1.2 Posters

1.2.1 A flexible dashboard for monitoring platform trials

Alessio Crippa, Karolinska Institutet, postdoc

Track(s): R Applications

Abstract:

The Data and Safety Monitoring Board (DSMB) is an essential component for a successful clinical trial. It consists of an independent group of experts that periodically revise and evaluate the accumulating data from an ongoing trial to assess patients’ safety, study progress, and drug efficacy. Based on their evaluation, a recommendation to continue, modify or stop the trial will be delivered to the trial’s sponsor. It is essential to provide the DSMB with the best delivery visualization tools for monitoring on a regular basis the live data from the study trial. We designed and developed an interactive dashboard using flexdashboard for R as a helping tool for assisting the DSMB in the evaluation of the results of the ProBio study, a clinical platform for improving treatment decision in patients with metastatic castrate resistant prostate cancer. We will focus on the customized structure for best displaying the most interesting variables and the adoption of interactive tools as a particularly useful aid for the assessment of the ongoing data. We will also cover the connection to the data sources, the automatic generation process, and the selected permission for the people in the DSMB to access the dashboard.

Coauthor(s): Andrea Discacciati, Erin Gabriel, Martin Eklund .

1.2.2 PRDA package: Enhancing Statistical Inference via Prospective and Retrospective Design Analysis.

Angela Andreella, University of Padua

Track(s): R Life Sciences

Abstract:

There is a growing recognition of the importance of power analysis and calculation of the appropriate sample size when planning a research experiment. However, power analysis is not the only relevant aspect of the design of an experiment. Other inferential risks, such as the probability of estimating the effect in the wrong direction or the average overestimation of the actual effects, are also important. The evaluation of these inferential risks as well as the statistical power, in what Gelman and Carlin (2014) defined as Design Analysis, may help researchers to make informed choices both when planning an experiment or evaluating study results. We introduce the PRDA (Prospective and Retrospective Design Analysis) package that allows researchers to carry a Design Analysis under different experimental scenarios (Altoè et al., 2020). Considering a plausible effect size (or its prior distribution) researchers can evaluate either the inferential risks for given sample size or the required sampled size to obtain a given statistical power. Previously, PRDA functions were limited to mean differences between groups considering Cohen’s d in the Null significance Hypothesis Testing (NHST) framework. Now, we present the newly developed features that include other effect sizes (such as Pearson’s correlation) as well as Bayes Factor hypothesis testing.

Coauthor(s): Vesely Anna, Zandonella Callegher Claudio, Pastore Massimiliano, Altoè Gianmarco .

1.2.3 Automate flexdashboard with GitHub

Binod Jung Bogati, Data Analyst Intern at VIN

Track(s): R Dataviz & Shiny

Abstract:

flexdashboard is a great tool for building an interactive dashboard in R. We can host it for free on GitHub Pages, Rpubs and many other places.

Hosted flexdashboard is static so changes in our data we need to manually update and publish every time. If we want to auto-update we may need to integrate Shiny. However, it may not be suitable for every case.

To overcome this, we have a solution called GitHub Action. It’s a feature from GitHub which automates our tasks in a convenient way.

With the help of GitHub Actions, we can automate our flexdashboard (Rmarkdown) updates. It builds a container that runs our R scripts. We can trigger it every time we push on GitHub or schedule it every X minutes/hours/days/month.

If you want to learn more about the GitHub Action. And also know how to automate updates on your flexdashboard. Please do come and join me.

1.2.4 EasyReporting: a Bioconductor package for Reproducible Research implementation

Dario Righelli, Department of Statistics, University of Padua, Post-Doc

Track(s): R Applications, R Life Sciences

Abstract:

EasyReporting is a novel R/Bioconductor package for speeding up Reproducible Research (RR) implementation when analyzing data, implementing workflows or other packages. It is an S4 class helping developers to integrate an RR layer inside their software products, as well as data analysts speeding up their report production without learning the rmarkdown language. Thanks to minimal additional efforts by developers, the end-user has available a rmarkdown file within all the source code generated during the analysis, divided into Code Chunks (CC) ready for the compilation. Moreover, EasyReporting gives also the possibility to add natural language comments and textual descriptions into the final report to be compiled for producing an enriched document that incorporates input data, source code and output results. Once compiled, the final document can be attached to the publication of the analysis as supplementary material, helping the interested community to entirely reproduce the computational part of work. Despite other previously proposed solutions, that usually require a significant effort by the final user, potentially bringing him/her to renounce to include RR inside the scripts, our approach is versatile and easy to be incorporated, allowing to the final developer/analyzer to automatically create and store a rmarkdown document, and providing also methods for its compilation.

Coauthor(s): angelini claudia .

1.2.5 NewWave: a scalable R package for the dimensionality reduction of single-cell RNA-seq

Federico Agostinis, Università degli studi di Padova, Fellowship

Track(s): R Life Sciences

Abstract:

The fast development of single cell sequencing technologies in the recent years has generated a gap between the throughput of the experiments and the capability of analizing the generated data. One recent method for dimensionality reduction of single-cell RNA-seq data is zinbwave, it uses zero inflated negative binomial likelihood function optimization to find biological meaningful latent factors and remove batch effect. Zinbwave has optimal performance but has some scalability issues due to large memory usage. To address this, we developed an R package with new software architec- ture extending zinbwave. In this package, we implement mini-batch stochastic gradient descent and the possibility of working with HDF5 files. We decide to use a negative binomial model following the observation that droplet sequencing technologies do not induce zero inflation in the data. Thanks to these improvements and the possi- bility of massively parallelize the estimation process using PSOCK clusters, we are able to speed up the computations with the same or even better results than zinbwave. This type of parallelization can be used on multiple hardware setups, ranging from simple laptops to dedicated server clusters. This, paired with the ability to work with out-of-memory data, enables us to analyze datasets with milions of cells.

Coauthor(s): Chiara Romualdi, Gabriele Sales, Davide Risso .

1.2.6 orf: Ordered Random Forests

Gabriel Okasa, Research Assistant and PhD Candidate at the Swiss Institute for Empirical Economic Research, University of St. Gallen, Switzerland

Track(s): R Machine Learning & Models

Abstract:

The R package ‘orf’ is a software implementation of the Ordered Forest estimator as developed in Lechner and Okasa (2019). The Ordered Forest flexibly estimates the conditional class probabilities of models involving categorical outcomes with an inherent ordering structure, known as ordered choice models. Additionally to common machine learning algorithms, the Ordered Forest enables estimation of marginal effects together with statistical inference and thus provides comparable output as in standard econometric models. Accordingly, the ‘orf’ package provides generic R functions to estimate, predict, plot, print and summarize the estimation output of the Ordered Forest along with various options for specific forest-related tuning parameters. Finally, computational speed is ensured as the core forest algorithm relies on the fast C++ forest implementation from the ranger package (Wright and Ziegler 2017).

Coauthor(s): Michael Lechner .

1.2.7 Power Supply health status monitoring dashboard

Marco Calderisi, Kode srl, CTO

Track(s): R Dataviz & Shiny

Abstract:

The Primis project dashboard allows to perform an analysis on the health status of power supplies boards on two levels: (1) analysis of a specific board, to check its status and the presence of any anomalies, (2) analysis of multiple boards within a single Power Supply, to check if the set of boards reveals abnormal behavior and if some boards behave in a distinctly different way from the others. The analysis algorithms and the web application were created using the programming language R, and in particular the Shiny library. The application is therefore divided into two parts that reflect these different types of analysis, called respectively “Product View” (analysis and diagnostics of a specific board) and “Product Comparison” (comparison analysis between multiple boards of the same Power Supply). Both analyzes can be carried out on an arbitrary time interval, selectable through a special application menu. The analysis is carried out by means of: (1) univariate analysis, focusing on a specific parameter of one or more channels and displaying aggregate information regarding the status of the board in the entire observation period (2) multivariate analysis, that is the application of multivariate algorithms that allows to perform an overall analysis of the board, taking into account all the variables simultaneously.

Coauthor(s): Jacopo Baldacci, Caterina Giacomelli, Ilaria Ceppa, Davide Massidda, Matteo Papi, Gabriele Galatolo, Francesca Giorgolo, ferdinando giordano, alessandro iovene .

1.2.8 First-year ICT students dropout predicting with R models

Natalja Maksimova, Virumaa College of Tallinn University of Technology, lecturer

Track(s): R Machine Learning & Models

Abstract:

The aim of this study is to find how it is possible to predict first-year ICT students dropout in one Estonian college, Virumaa College of Tallinn University of Technology (TalTech) and possibly to engage methods to decrease dropout rate. We perform three approaches of machine learning using R tools: logistic regressions, decision trees and Naive Bayes to predict. The models are computed on the basis of the TalTech study information system data. As a result, we propose a methodical approach that may be realized in practice at other institutions. All applied methods yield high prediction with more than 85% accuracies. In the same time some influencing and non-influencing factors were found in predicting ICT students’ dropout.

Coauthor(s): Olga Dunajeva .

1.2.9 Benchmark Percentage Disjoint Data Splitting in Cross Validation for Assessing the Skill of Machine

Olalekan Joseph Akintande, University of Ibadan, Ph.D. Student

Track(s): R Machine Learning & Models

Abstract:

The controversies surrounding dataset splitting technique and folklore of what has been or what should be, remain an open debate. Several authors (bloggers, researchers, and data scientists) in the field of machine learning and similar research areas, have proposed various arbitrary percentage disjoint dataset splitting (DDS) options for validating the skill of machine learning algorithms and by extension the appropriate percentage DDS based on cross-validation techniques. In this work, we propose benchmarks for which the percentage DDS procedure should be based. These benchmarks are founded on various training sizes (m) and serve as the basis and justification for the choice of an appropriate percentage DDS for assessing the skill of ML algorithms and related fields, on the concept of cross-validation techniques.

Coauthor(s): O.E. Olubusoye .

1.2.10 Integrating professional software engineering practices in medical research software

Patricia Ryser-Welch, Newcastle University, Population Health Science Institute, Research Associate,

Track(s): R Applications

Abstract:

Health data sets are getting bigger, more complex, and are increasingly being linked up with other data sources. With this trend there is an increasing risk of patient identification and disclosure. Two different ways of mitigating this risk are to use a federated analysis approach or to use a data safe haven.

DataSHIELD (www.datashield.ac.uk) is an established federated data analysis tool that is used in the medical sciences. This software has a variety of methods to reduce the risk of disclosure built in. Here we describe the steps we are taking to apply modern software engineering methodologies to DataSHIELD. The upcoming Medical Devices legislation requires that software has more rigourous testing done on it. While this legislation does not directly apply to software used for research, we think it is important the ideas behind this do filter down to research software. For us these principles include testing that functions work, as well as testing that they produce the correct answers. Using a static standard data set to test against (that is publicly available) is also an important aspect. This work is being done in a continuous integraion framework using Microsoft Azure. Additionally all our software is developed as open source.

In addition to the protection DataSHIELD provides on its own we are also integrating it into our Trusted Research Environment as part of Connected Health Cities North East and North Cumbria. This will give an extra level of protection to data that may automatically flow from multiple data sources. Additionally, as analysis can be done in a federated way it means that that data does not need to leave its data controller’s environment. This opens up the possibility of analysis happening across trusts and regions.

Coauthor(s): http://www.datashield.ac.uk .

1.2.11 Dealing with changing administrative boundaries: The case of Swiss municipalities

Tobias Schieferdecker, Daily dealings with data at cynkra

Track(s): R Applications

Abstract:

Switzerland’s municipalities are frequently merged or reorganized, in an attempt to reduce costs and increase efficiency. These mergers create a substantial problem for data analysis. Often it is desirable to study a municipality over time. But in order to create a time series for a region of interest, its borders should stay constant. Our goal is to provide R-functions that allow an easy and consistent handling of these mergers. We create a mapping table for municipality ID’s for a specified period of time, that allows us to track the mergers over time. We also provide weights, such as population as well as area of the municipalities, to facilitate the construction of weighted time series. Various other municipality mutations are also taken into account.

We are creating two R packages: an infrastructure package that handles the task of keeping the data up to date; and a user package that contains the functions to deal with the mergers.

Coauthor(s): Thomas Knecht, Kirill Müller, Christoph Sax .

1.2.12 badDEA: An R package for measuring firms’ efficiency adjusted by undesirable outputs

Yann Desjeux, INRAE, France

Track(s): R Applications

Abstract:

Growing concerns on the detrimental effects of human production activities on the environment, e.g. air, soil and water pollution, have triggered the development of new performance indicators (including productivity and efficiency measures) accounting for such undesirable impacts. Firms can now be benchmarked not only in terms of economic performance, but also in terms of environmental performance linked to production. In the performance benchmarking literature, and more specifically the one on the non-parametric approach Data Envelopment Analysis (DEA), several methodologies have been developed to consider these impacts as undesirable (or bad) outputs. Related empirical applications in the literature, performed with various software, show that conclusions differ depending on the way these undesirable outputs are introduced. However none of these methodologies are routinely developed in R. In this context, we developed the badDEA package in order to provide a consistent and single framework where users (students, researchers, practitioners) can find the major methodologies proposed in the literature to compute efficiency measures that are adjusted by undesirable outputs. In this presentation, we will describe the aim, structure and options of the badDEA package, unfolding all the methodologies in their different variants and providing a promising tool for decision-making.

Coauthor(s): K Hervé DAKPO; Yann DESJEUX; Laure LATRUFFE .