1.3 Regular talks

1.3.1 Design Patterns For Big Shiny Apps

Alex Gold, Solutions Engineer, RStudio

Track(s): R Dataviz & Shiny, R Production

Abstract:

In about 20 minutes on the morning of January 27, 2020, one engineer launched over 500 individual cloud server instances for workshop attendees at RStudio::conf and managed them for the duration of the workshops — all from a Shiny app. The RStudio team manages a variety of production systems using Shiny apps including our workshop infrastructure and access to our sales demo server.

The Shiny apps are robust enough for these mission-critical activities because of an important lesson from web engineering: separation of concerns between front-end user interaction logic and back-end business logic. This design pattern can be implemented in R by creating user interfaces in Shiny and managing interactions with other systems with Plumber APIs and R6 classes.

This pattern allows for even complex Shiny apps to still be understandable and maintainable. Moreover, this pattern of designing and building large Shiny apps is broadly applicable to any app that has substantial interaction with outside systems. Session attendees will gain an understanding of this pattern, which can be useful for many large Shiny apps.

Coauthor(s): Cole Arendt .

1.3.2 Using XGBoost, Plumber and Docker in production to power a new banking product

André Rivenæs, Data Scientist, PwC

Markus Mortensen, PwC

Track(s): R Machine Learning & Models, R Production

Abstract:

Buffer is a brand new and innovative banking product by one of the largest retail banks in Norway, Sparebanken Vest, and it is powered by R.

In fact, the product’s decision engine is written entirely in R. We analyze whether a customer should get a loan and how much loan they should be allocated by analyzing large amounts of data from various sources. An essential part is analyzing the customer’s invoices using machine learning (XGBoost).

In this talk, we will cover:

  • How we use ML and Bayesian statistics to estimate the probability of an invoice being repaid.
  • How we successfully put the decision engine in production, using e.g. Plumber, Docker, CircleCI and Kubernetes.
  • What we have learned from using R in production at scale.

1.3.3 Astronomical source detection and background separation: a Bayesian nonparametric approach

Andrea Sottosanti, University of Padova

Track(s): R Machine Learning & Models, R Applications

Abstract:

We propose an innovative approach based on Bayesian nonparametric methods to the signal extraction of astronomical sources in gamma-ray count maps under the presence of a strong background contamination. Our model simultaneously induces clustering on the photons using their spatial information and gives an estimate of the number of sources, while separating them from the irregular signal of the background component that extends over the entire map. From a statistical perspective, the signal of the sources is modeled using a Dirichlet Process mixture, that allows to discover and locate a possible infinite number of clusters, while the background component is completely reconstructed using a new flexible Bayesian nonparametric model based on b-spline basis functions. The resultant can be then thought of as a hierarchical mixture of nonparametric mixtures for flexible clustering of highly contaminated signals. We provide also a Markov chain Monte Carlo algorithm to infer on the posterior distribution of the model parameters, and a suitable post-processing algorithm to quantify the information coming from the detected clusters. Results on different datasets confirm the capacity of the model to discover and locate the sources in the analysed map, to quantify their intensities and to estimate and account for the presence of the background contamination.

Coauthor(s): Mauro Bernardi, Alessandra R. Brazzale, Roberto Trotta, David A. van Dyk .

1.3.4 Creating drag-and-drop shiny applications using sortable

Andrie de Vries, Solutions engineer at RStudio, Author of “R for Dummies”

Track(s): R Dataviz & Shiny

Abstract:

Using the learnr package you can create interactive tutorials in your R Markdown documents. For a long time, you could only use the built-in question types, including R coding exercises and quizzes with single or multiple choice answers. Since the release of learnr version 0.10.0, it has been possible to create custom question types. The new framework allows you to define different behaviour for the appearance and behaviour of your questions. The sortable package uses this capability to introduce new question types for ranking questions and bucketing questions. With ranking questions you can ask your students to arrange answer options in the correct order. With bucketing questions you can ask them to arrange answer options into different buckets. The sortable package achieves this by exposing an htmlwidget wrapper around the SortableJS JavaScript library. This library lets you sort any object in a Shiny app, with dynamic drag-and-drop behaviour. For example, you can arrange items in a list, or drag-and-drop the order of shiny tabs. During this presentation you will see how easy it is to add dynamic behaviour to your shiny app, and how simple it is to use the new sorting and bucketing tasks in your tutorials.

Coauthor(s): Barret Schloerke, Kenton Russell .

1.3.5 High dimensional sampling and volume computation

Apostolos Chalkis, PhD in Computer Science

Track(s): R Machine Learning & Models

Abstract:

Sampling from multivariate distributions is a fundamental problem in statistics that plays important role in modern machine learning and data science. Many important problems such as convex optimization and multivariate integration can be efficiently solved via sampling. This talk presents the CRAN package volesti which offers to R community efficient C++ implementations of state-of-the-art algorithms for sampling and volume computation of convex sets. It scales up to hundred or thousand dimensions, depending the problem, providing the most efficient implementations for sampling and volume computation to date. Thus, volesti allows users to solve problems in dimensions and order of magnitude higher than before. We present the basic functionality of volesti and show how it can be used to provide approximate solutions to intractable problems in combinatorics, financial modeling, bioinformatics and engineering. We stand out two famous applications in finance. We show how volesti can be used to detect financial crises and evaluate portfolios performance in large stock markets with hundreds of assets, by giving real life examples using public data.

Coauthor(s): Vissarion Fisikopoulos .

1.3.6 Fake News: AI on the battle ground

Ayomide Shodipo, Senior Developer Advocate & Media Developer Expert at Cloudinary

Track(s): R Machine Learning & Models, R Life Sciences, R Production, R World

Abstract:

Assumed products have been a longstanding and growing pain for companies around the globe. In addition to impacting company revenue, they damage brand reputation and customer confidence. Companies were asked to build a solution for a global electronics brand that can identify fake products by just taking one picture on a smartphone.

In this session, we will look into the building blocks that make this AI solution work. We’ll find out that there is much more to it than just training a convolutional neural network.

We look at challenges like how to manage and monitor the AI model and how to build and improve the model in a way that fits your DevOps production chain.

Learn how we used Azure Functions, Cosmos DB and Docker to build a solid foundation. See how we used the Azure Machine Learning service to train the models. And find out how we used Azure DevOps to control, build and deploy this state-of-the-art solution.

1.3.7 From consulting to open-source and back

Christoph Sax, R-enthusiast, economist @cynkra

Track(s): R World

Abstract:

Open-source development is a great source of satisfaction and fulfillment, but someone has to pay the bills. A straightforward solution is to consult customers and help them to pick the right tools. As a small group of R enthusiasts, we try to align open source development by supporting our clients to accomplish their goals, contributing to the community along the way. It turns out that the benefits work in both ways: In addition to funding, consulting work allows us to test our tools and to improve their usability in a practical setting. At the same time, the involvement in open source development sharpens our analytical skills and serves as a first stop for new customers. Ideally, consulting projects lead to new developments, which in turn lead to new consulting projects.

Coauthor(s): Kirill Müller .

1.3.8 Deduplicating real estate ads using Naive Bayes record linkage

Daniel Meister, Datahouse AG

Track(s): R Applications

Abstract:

We demonstrate in this talk, how we used a containerized R and PostgreSQL data pipeline to deduplicate 60 million real estate ads from Germany and Switzerland using a multi-step Naive Bayes record linkage approach. Real estate platforms publish millions of rental flat and condominium ads yearly. A given region or country of interest is normally covered by various competing platforms, leading to multiple published ads for a single real world object. Because quantifying and modeling the real estate market requires unbiased input data, our aim was to deduplicate real estate ads using Naive Bayes record linkage. We used commercially available German and Swiss real estate ad data from 2012 to 2019 consisting of approximately 60 million individual records. After multiple data cleaning and preparation steps we employed a Naive Bayes weighting of 12-14 variables to calculate similarity scores between ads and determined a linkage threshold based on expert judgment. The deduplication pipeline consisted of three steps: linking ads based on identity comparisons, linking similar ads within small regional areas (municipalities) and linking similar ads within large regional areas (cantons, states). The pipeline was deployed as a containerized setup with in-memory calculations in R and out-of-memory calculations and data storage in PostgreSQL. Deduplication linked the around 60 million ads to around 14 million object groups (Germany: 10 millions, Switzerland: 4 millions). The distribution of similarity scores showed high separation power and the resulting object groups displayed high homogeneity in geographic location and price distribution. Furthermore, yearly results corresponded well with published relocation rates. Using Naive Bayes record linkage to deduplicate real estate ads resulted in a sensible grouping of ads into object groups (rental flats, condominiums). We were able to combine similarities across different variables into a single similarity score. An advantage of the Naive Bayes approach is the high interpretability of the influence of individual variables. However, by manually determining the linkage threshold our results are heavily influenced by possible expert biases. The containerized R and PostgreSQL setup proved it’s portability and scaling capabilities. The same approach could easily be transferred to other domains requiring deduplication of multivariate data sets.

1.3.9 {polite}: web etiquette for R users

Dmytro Perepolkin, Lund University

Track(s): R World, R Applications

Abstract:

Data is everywhere, but it does not mean it is freely available. What are best practices and acceptable norms for accessing the data on the web? How does one know when it is OK to scrape the content of a website and how to do it in such a way that it does not create problems for data owner and/or other users? This talk with provide examples of using {polite} package for safe and responsible web scraping. The three pillars of {polite} are seeking permission, taking slowly and never asking twice.

1.3.10 Hydrological Modelling and R

Emanuele Cordano, www.rendena100.eu

Track(s): R Applications

Abstract:

Eco-hydrological and biophysical models are increasingly used in the contexts of hydrology, ecology, precision agriculture for better management of water resources and climate change impact studies at various scales: local, watershed or regional scale. However, to satisfy the researchers and stakeholders demand, user friendly interfaces are needed. The integration of such models in the powerful software environment of R greatly eases the application, input data preparation, output elaboration and visualization. In this work we present new developments for a R interface (R open-source package geotopbricks (https://CRAN.R-project.org/package=geotopbricks) and related) for the GEOtop hydrological distributed model (www.geotop.org - GNU General Public License v3.0). This package aims to be a link between the work of environmental engineers, who develop hydrological models, and the ones of data and applied scientists, who can extract information from the model results. Applications related to the simulation of water cycle dynamics (model calibration, mapping, data visualization) in some alpine basins and under scenarios of climate change and variability are shown. In particular, we will present an application to predict with the model winter snow conditions, which play a critical role in governing the spatial distribution of fauna in temperate ecosystems.

Coauthor(s): Giacomo Bertoldi .

1.3.11 GeneTonic: enjoy RNA-seq data analysis, responsibly

Federico Marini, Center for Thrombosis and Hemostasis (CTH) & Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI) - University Medical Center Mainz

Track(s): R Life Sciences

Abstract:

Interpreting the results from RNA-seq transcriptome experiments can be a complex task, where the essential information is distributed among different tabular and list formats - normalized expression values, results from differential expression analysis, and results from functional enrichment analyses.

The identification of relevant functional patterns, as well as their contextualization in the data and results at hand, are not straightforward operations if these pieces of information are not combined together efficiently.

Interactivity can play an essential role in simplifying the way how one accesses and digests RNA-seq data analysis in a more comprehensive way.

I introduce GeneTonic (https://github.com/federicomarini/GeneTonic), an application developed in Shiny and based on many essential elements of the Bioconductor project, that aims to reduce the barrier to understanding such data better, and to efficiently combine the different components of the analytic workflow.

For example, starting from bird’s eye perspective summaries (with interactive bipartite gene-geneset graphs, or enrichment maps), it is easy to generate a number of visualizations, where drill-down user actions enable further insight and deliver additional information (e.g., gene info boxes, geneset summary, and signature heatmaps).

Complex datasets interpretation can be wrapped up into a single call to the GeneTonic main function, which also supports built-in RMarkdown reporting, to both conclude an exploration session, or also to generate in batch the output of the available functionality, delivering an essential foundation for computational reproducibility.

1.3.12 A simple and flexible inactivity/sleep detection R package

Francesca Giorgolo, Kode s.r.l. - Data Scientist

Track(s): R Life Sciences

Abstract:

With the widespread usage of wearable devices great amount of data became available and new fields of application arised, like health monitoring and activity detection. Our work focused on inactivity and sleep detection from continuous raw tri-axis accelerometer data, recorded using different accelerometers brands having sampling frequencies below and above 1Hz. The algorithm implemented is the SPT-window detection algorithm described in literature slighty modified to met the flexibility requirement we imposed ourselves. The R package developed provides functions to clean data, to identify inactivity/sleep windows and to visualize the results. The main function has a parameter to specify the measurement unit of the data, a threshold to distinguish low and high activity and also a parameter to handle non-wearing periods, where a non wear period is defined as a period of time where all the accelerometers are equal to zero. Other functions allow to separate overlapped accelerometer signals, i.e. when a device is replaced by another, and to visualize the obtained results.

Coauthor(s): Ilaria Ceppa, Marco Calderisi, Davide Massidda, Matteo Papi, Gabriele Galatolo, Andrea Spinelli, Andrea Zedda, Jacopo Baldacci, Caterina Giacomelli .

1.3.13 progressr: An Inclusive, Unifying API for Progress Updates

Henrik Bengtsson, UCSF, Assoc Prof, CS/Stats, R since 2000

Track(s): R Production, R Applications

Abstract:

The ‘progressr’ package provides a minimal, unifying API for scripts and packages to report progress from anywhere including when using parallel processing to anywhere.

It is designed such that the developer can focus on what to report progress on without having to worry about how to present it. The end user has full control of how, where, and when to render these progress updates. Progress bars from popular progress packages are supported and more can be added.

The ‘progressr’ is inclusive by design. Specifically, no assumptions are made how progress is reported, i.e. it does not have to be a progress bar in the terminal. Progress can also be reported as audio (e.g. unique begin and end sounds with intermediate non-intrusive step sounds), or via a local or online notification system.

Another novelty is that progress updates are controlled and signaled via R’s condition framework. Because of this, there is no need for progress-specific arguments and progress can be reported from nearly everywhere in R, e.g. in classical for and while loops, within map-reduce APIs like the ‘lapply()’ family of functions, ‘purrr’, ‘plyr’, and ‘foreach’. It also works with parallel processing via the ‘future’ framework, e.g. ‘future.apply’, ‘furrr’, and ‘foreach’ with ‘doFuture’. The package is compatible with Shiny applications.

1.3.14 varycoef: Modeling Spatially Varying Coefficients

Jakob Dambon, HSLU & UZH, Switzerland

Track(s): R Machine Learning & Models

Abstract:

In regression models for spatial data, it is often assumed that the marginal effects of covariates on the response are constant over space. In practice, this assumption might often be questionable. Spatially varying coefficient (SVC) models are commonly used to account for spatial structure within the coefficients. With the R package varycoef, we provide the frame work to estimate Gaussian process-based SVC models. It is based on maximum likelihood estimation (MLE) and in contrast to existing model-based approaches, our method scales better to data where both the number of spatial points is large and the number of spatially varying covariates is moderately-sized, e.g., above ten. We compare our methodology to existing methods such as a Bayesian approach using the stochastic partial differential equation (SPDE) link, geographically weighted regression (GWR), and eigenvector spatial filtering (ESF) in both a simulation study and an application where the goal is to predict prices of real estate apartments in Switzerland. The results from both the simulation study and application show that our proposed approach results in increased predictive accuracy and more precise estimates.

Coauthor(s): Fabio Sigrist, Reinhard Furrer .

1.3.15 FastAI in R: preserving wildlife with computer vision

Jędrzej Świeżewski, Data Scientist at Appsilon

Track(s): R Machine Learning & Models

Abstract:

In this presentation, we will discuss using the latest techniques in computer vision as an important part of “AI for Good” efforts, namely, enhancing wildlife preservation. We will present how to make use of the latest technical advancements in an R setup even if they are originally implemented in Python.

A topic rightfully receiving growing attention among Machine Learning researchers and practitioners is how to make good use of the power obtained with the advancement of the tools. One of the avenues in these efforts is assisting wildlife conservation by employing computer vision in making observations of wildlife much more effective. We will discuss several of such efforts during the talk.

One of the very promising frameworks for computer vision developed recently is the Fast.ai wrapper of PyTorch, a Python framework used for computer vision among other things. While it incorporates the latest theoretical developments in the field (such as one cycle policy training) it provides an easy to use framework allowing a much wider audience to benefit from the tools, such as AI for Good initiatives run by people who are not formally trained in Machine Learning.

During the presentation we will show how to make use of a model trained using the Python’s fastai library within an R workflow with the use of the reticulate package. We will focus on use cases concerning classifying species of African wildlife based on images from camera traps.

Coauthor(s): Marek Rogala .

1.3.16 The R Consortium 2020: adapting to rapid change and global crisis

Joseph Rickert, RStudio: R Community Ambassador, R Consortium’s Board of Directors

Track(s): R World

Abstract:

The COVID-19 pandemic has turned the world upside down, and like everyone else the R Community is learning how to adapt to rapid change in order to carry on important work while looking for ways to contribute to the fight against the pandemic. In this talk, I will report on continuing R Community work being organized through the R Consortium such as the R Hub, R User Group Support Program and Diversity and Inclusion Projects; and through the various working groups including the Validation Hub, R / Pharma, R / Medicine and R Business. Additionally, I will describe some of the recently funded ISC projects and report on the COVID-19 Data Forum, a new project that the R Consortium is organizing in partnership with Stanford’s Data Science Institute.

1.3.17 Powering Turing e-Atlas with R

Layik Hama, Alan Turing Institute

Track(s): R Applications, R Production, R Dataviz & Shiny

Abstract:

Turing e-Atlas is a research project under the Urban Analytics research theme at Alan Turing Institute (ATI). The ATI is UK’s national institute for data science and Artificial Intelligence based at the British Library in London.

The research is a grand vision for which we have been trying to take baby steps under the banner of an e-Atlas. And we believe R is positioned to play a foundation role in any scalable solution to analyse and visualize large scale datasets especially geospatial datasets.

The application presented is built using RStudio’s Plumber package which relies on solid libraries to develop web applications. The front-end is made up of Uber’s various visualization packages using Facebook’s React JavaScript framework.

Coauthor(s): Dr Nik Lomax, Dr Roger Beecham .

1.3.18 Using process mining principles to extract a collaboration graph from a version control system log

Leen JOOKEN, Hasselt University, PhD student

Track(s): R Production, R World

Abstract:

Knowledge management is an indispensable component of modern-day, fast changing and flexible software engineering environments. A clear overview on how software developers collaborate can reveal interesting insights such as the general structure of collaboration, crucial resources, and risks in terms of knowledge preservation that can arise when a programmer decides to leave the company. Version control system (VCS) logs, which keep track of what team members work on and when, contain data to provide these insights. We present an R package that provides an algorithm which extracts and visualizes a collaboration graph from VCS log data. The algorithm is based on principles from graph theory, cartography and process mining. Its structure consists of four phases: (1) building the base graph, (2) calculating weights for nodes and edges, (3) simplifying the graph using aggregation and abstraction, and (4) extending it to include specific insights of interest. Each of these phases offers the user a lot of flexibility in deciding which parameters and metrics to include. This makes it possible for the human expert to exploit his existing knowledge about the project and team to guide the algorithm in building the graph that best fits his specific use case, and hence will provide the most accurate insights.

Coauthor(s): Gert Janssenswillen, Mathijs Creemers, Mieke Jans, Benoît Depaire .

1.3.19 Manifoldgstat: an R package for spatial statistics of manifold data

Luca Torriani, MOX, Department of Mathematics, Politecnico di Milano

Ilaria Sartori, Politecnico di Milano

Track(s): R Machine Learning & Models

Abstract:

The statistical analysis of data belonging to Riemannian manifolds is becoming increasingly important in many applications, such as shape analysis or diffusion tensor imaging. In many cases, the available data are georeferenced, making spatial dependence a non-negligible data characteristic. Modeling and accounting for it, typically, is not trivial, because of the non-linear geometry of the manifold. In this contribution, we present the Manifoldgstat R package, which provides a set of fast routines allowing to efficiently analyze sets of spatial Riemannian data, based on state-of-the-art statistical methodologies. The package stems from the need to create an efficient and reproducible environment allowing to run extensive simulation studies and bagging algorithms for spatial prediction of symmetric positive definite matrices. The package implements three main algorithms (Pigoli et al, 2016, Menafoglio et al, 2019, Sartori & Torriani, 2019). The latter two are particularly computationally demanding, as they rely on Random Domain Decompositions of the geographical domain. To substantially improve performances, the package exploits dplyr and Rcpp to integrate R with C++ code, where template factories handle all run-time choices. In this communication, we shall revise the characteristics of the three methodologies considered, and the key-points of their implementation.

Coauthor(s): Alessandra Menafoglio, Piercesare Secchi .

1.3.20 Voronoi Linkage for Spatially Misaligned Data

Luís G. Silva e Silva, Food and Agriculture Organization - FAO - Data Scientist

Track(s): R Dataviz & Shiny, R World

Abstract:

In studies of elections, voting outcomes are point-referenced at voting stations while socioeconomic covariates are areal data available at census tracts. The misaligned spatial structure of these two data sources makes the regression analysis to identify socioeconomic factors that affect the voting outcomes a challenge. Here we propose a novel approach to link these two sources of spatial data through Voronoi tessellation. Our proposal is creating a Voronoi tessellation with respect to the point-referenced data, with this outset, the spatial points become a set of mutually exclusive polygons named Voronoi cells. The extraction of data from the census tracts is proportional to the intersection area of each census tract polygon and Voronoi cells. Consequently, we use 100% of the available information and preserve the polygons’ autocorrelation structure. When visualised through our Shiny App, the method provides a finer spatial resolution than municipalities and facilitates the identification of spatial structures at a most detailed level. The technique is applied for the 2018 Brazilian presidential election data. The tool provides deep access to Brazilian election results by enabling to create general maps, plots, and tables by states and cities.

Coauthor(s): Lucas Godoy, Douglas Azevedo, Augusto Marcolin, Jun Yan .

1.3.21 Be proud of your code! Tools and patterns for making production-ready, clean R code

Marcin Dubel, Software Engineer at Appsilon Data Science

Track(s): R Production, R World

Abstract:

In this talk you’ll learn the tools and best practices for making clean, reproducible R code in a working environment ready to be shared and productionalised. Save your time for maintenance, adjusting and struggling with packages.

R is a great tool for fast data analysis. It’s simplicity in setup combined with powerful features and community support makes it a perfect language for many subject matter experts e.g. in finance or bioinformatics. Yet often what started as a pet project or proof of concept begins to grow and expand, with additional collaborators working on it. It is then crucial that you have your project organised well, reusable, with an environment set, so that the code works every time and on any machine. Otherwise the solution won’t be used by anyone but you. By following a few patterns and with appropriate tools it won’t be overwhelming or disturbing and will highlight the true value of the code.

Both Appsilon and I personally have taken part in many R projects for which the goal was to clean and organise the code as well as the project structure. We would like to share our experience, best practices and useful tools to share code shamelessly.

During the presentation I will show: setting up the development environment with packrat, renv and docker, organising the project structure, the best practices in writing R code, automated with linter, sharing the code using git, organising workflow with drake, optimising the Shiny apps and data loading with plumber and database, preparing the tests and continuous integration circle CI.

1.3.22 Going in the fast lane with R. How we use R within the biggest digital dealer program in EMEA.

Marco Cavaliere, Like Reply - Business Data Analyst

Track(s): R Production, R Machine Learning & Models

Abstract:

How we are using R as the foundation of all the data-related tasks in the biggest dealer digital program at FCA. From simple tasks as dashboarding or reporting to more strategic capabilities as predicting advertising ROI through Tensorflow or developing a production-grade, data-driven microservices, we leverage R ecosystem to deliver better results and increase the data awareness for all the project stakeholders.

1.3.23 R alongside Airflow, Docker and Gitlab CI

Matthias Bannert, Research Engineering Lead at ETH Zurich, KOF Swiss Economic Institute

Track(s): R Production

Abstract:

The KOF Swiss Economic Institute at ETH Zurich (Switzerland) regularly surveys more than 10K companies, computes economic indicators and forecasts as it monitors the national economy. Today, the institute updates its website in automated fashion and operates automated data pipelines to partners such as regional statistical offices or the Swiss National Bank. At KOF, production is based on an open source ecosystem to a large degree. More and more processes are being migrated to an environment that consists of open source components Apache Airflow, Docker, Gitlab Continous Integration, PostgreSQL and R. This talk shows not only how well R interfaces and works with all parts from workflow automation to databases, but also how R’s advantages impact this system: From R Studio Servers to internal packages and an own internal mini-CRAN, the use of the R language is crucial in making the environment stable and convenient to maintain with the software carpentry type of background.

1.3.24 DaMiRseq 2.0: from high dimensional data to cost-effective reliable prediction models

Mattia Chiesa, Senior data scientist @ Centro Cardiologico Monzino IRCCS

Track(s): R Life Sciences

Abstract:

High dimensional data generated by modern high-throughput platforms pose a great challenge in selecting a small number of informative variables, for biomarker discovery and classification. Machine learning is an appropriate approach to derive general knowledge from data, identifying highly discriminative features and building accurate prediction models. To this end, we developed the R/Bioconductor package DaMiRseq, which (i) helps researchers to filter and normalize high dimensional datasets, arising from RNA-Seq experiments, by removing noise and bias and (ii) exploits a custom machine learning workflow to select the minimum set of robust informative features able to discriminate classes. Here, we present the version 2.0 of the DaMiRseq package, an extension that provides a flexible and convenient framework for managing high dimensional data such as omics data, large-scale medical histories, or even social media and financial data. Specifically, DaMiRseq 2.0 implements new functions that allow training and testing of several different classifiers and selection of the most reliable one, in terms of classification performance and number of selected features. The resulting classification model can be further used for any prediction purpose. This framework will give users the ability to build an efficient prediction model that can be easily replicated in further related settings.

Coauthor(s): Chiara Vavassori, Gualtiero I. Colombo, Luca Piacentini .

1.3.25 How to apply R in a hospital environment on standard available hospital-wide data

Mieke Deschepper, University hospital Ghent, staf member Strategic Policy cell, Ph.D.

Track(s): R Life Sciences

Abstract:

Lots of data is registered within hospitals, for financial, clinical and administrative purposes. Today, this data is barely used. Due to not knowing the existence of the data, the possible applications and the skills to execute the analysis, … In this presentation we show how we can apply R on this data and what the possibilities are using standard available hospital-wide data on a low cost budget. 1. Reporting with R - using R and markdown as a tool for management reporting - Using R for data handling (ETL) - Shiny applications as alternative for dashboarding 2. Using R as a statistical tool - Performing regression models to gain insight in certain predictor 3. Using R a data science tool - Using R to perform Machine Learning analysis, e.g. Random Forests - Using R for the data wrangling and handle the high dimensional data 4. Requirements for all of the above

1.3.26 Computer Algebra Systems in R

Mikkel Meyer Andersen, Assoc. Prof., Department of Mathematical Sciences, Aalborg University, Denmark

Track(s): R World, R Machine Learning & Models, R Applications

Abstract:

R’s ability to do symbolic mathematics is largely restricted to finding derivatives. There are many tasks involving symbolic math and that are of interest to R users, e.g. inversion of symbolic matrices, limits and solving non-linear equations. Users must resort to other computer algebra systems (CAS) for such tasks and many R users (especially outside of academia) do not readily have access to such software. There are also other indirect use-cases of symbolic mathematics in R that can exploit other strengths of R, including Shiny apps with auto-generated mathematics exercises.

We are maintaining two packages enabling symbolic mathematics in R: Ryacas and caracas. Ryacas is based on Yacas (Yet Another Computer Algebra System) and caracas is based on SymPy (Python library). Each have their advantages: Yacas is extensible and has a close integration to R which makes auto-generated mathematics exercises easy to make. SymPy is feature-rich and thus gives many possibilities.

In this talk we will discuss the two packages and demonstrate various use-cases including uses that help understanding statistical models and Shiny apps with auto-generated mathematics exercises.

Coauthor(s): Søren Højsgaard .

1.3.27 Interpretable and accessible Deep Learning for omics data with R and friends

Moritz Hess, Research Associate, Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg

Track(s): R Life Sciences

Abstract:

Recently, generative Deep Learning approaches were shown to have a huge potential for e.g. retrieving compact, latent representations of high-dimensional omics data such as single-cell RNA-Seq data. However, there are no established methods to infer how these latent representations relate to the observed variables, i.e. the genes.

For extracting interpretable patterns from gene expression data that indicate distinct sub-populations in the data, we here employ log-linear models, applied to the synthetic data and corresponding latent representations, sampled from generative deep models, which were trained with single-cell gene expression data.

While omics data are routinely analyzed in R and powerful toolboxes, tailored to omics data are available, there are no established and truely accessible approaches for Deep Learning applications here.

To close this gap, we here demonstrate how easily customizable Deep Learning frameworks, developed for the Julia programming language, can be leveraged in R, to perform accessible and interpretable Deep Learning with omics data.

Coauthor(s): Stefan Lenz, Harald Binder .

1.3.28 Elevating shiny module with {tidymodules}

Mustapha Larbaoui, Novartis, Associate Director, Scientific Computing & Consulting

Track(s): R Dataviz & Shiny

Abstract:

Shiny App developers have warmly welcomed the concept of Shiny modules as a way to simplify the app development process through the introduction of reusable building blocks. Shiny modules are similar in spirit to the concept of functions in R, except each is implemented with paired ui and server codes along with their own namespace. The {tidymodules} R package introduces a novel structure that harmonizes module development based on R6 (https://r6.r-lib.org/), which is an implementation of encapsulated object-oriented programming for R, thus knowledge of R6 is a prerequisite for using {tidymodules} to develop Shiny modules. Some key features of this package are module encapsulation, reference semantics, central module store and an innovative framework for enabling and facilitating cross module – module communication. It does this through the creation of “ports”, both input and output, where users may pass data and information through pipe operators. Because the connections are strictly specified, the module network may be visualized which shows how data move from one module to another. We feel the {tidymodules} framework will simplify the module development process and will reduce the code complexity through programming concepts like inheritance.

Coauthor(s): Doug Robinson, Xiao Ni, David Granjon .

1.3.29 APFr: Average Power Function and Bayes FDR for Robust Brain Networks Construction

Nicolo’ Margaritella, University of Edinburgh

Track(s): R Life Sciences

Abstract:

Brain functional connectivity is widely investigated in neuroscience. In recent years, the study of brain connectivity has been largely aided by graph theory. The link between time series recorded at multiple locations in the brain and a graph is usually an adjacency matrix. This converts a measure of the connectivity between two time series, typically a correlation coefficient, into a binary choice on whether the two brain locations are functionally connected or not. As a result, the choice of a threshold over the correlation coefficient is key. In the present work, we propose a multiple testing approach to the choice of a suitable threshold using the Bayes false discovery rate (FDR) and a new estimator of the statistical power called average power function (APF) to balance the two types of statistical error. We show that the proposed APF behaves well in case of independence of the tests and it is reliable under several dependence conditions. Moreover, we propose a robust method for threshold selection using the 5% and 95% percentiles of APF and FDR bootstrap distributions, respectively, to improve stability. In addition, we developed a R-package called APFr which performs APF and Bayes FDR robust estimation and provides simple examples to improve usability. The package has attracted more than 3200 downloads since its publication online (June 2019) at https://CRAN.R-project.org/package=APFr.

Coauthor(s): Piero Quatto .

1.3.30 Flexible Meta-Analysis of Generalized Additive Models with metagam

Øystein Sørensen, Associate Professor, University of Oslo

Track(s): R Life Sciences, R Machine Learning & Models

Abstract:

Analyzing biomedical data from multiple studies has great potential in terms of increasing statistical power, enabling detection of associations of smaller magnitude than would be possible analyzing each study separately. Restrictions due to privacy or proprietary data as well as more practical concerns can make it hard to share datasets, such that analyzing all data in a single mega-analysis might not be possible. Meta-analytic methods provide a way to overcome this issue, by combining aggregated quantities like model parameters or risk ratios. However, most meta-analytic tools have focused on parametric statistical models, and software for meta-analyzing semi-parametric models like generalized additive models (GAMs) have not been developed. The metagam package attempts to fill this gap: It provides functionality for removing individual participant data from GAM objects such that they can be analyzed in a common location; furthermore metagam enables meta-analysis of the resulting GAM objects, as well as various tools for visualization and statistical analysis. This talk will illustrate use of the metagam package for analysis of the relationship between sleep quality and brain structure using data from six European brain imaging cohorts.

Coauthor(s): Andreas Brandmaier .

1.3.31 EPIMOD: A computational framework for studying epidemiological systems.

Paolo Castagno, Ph.D.

Track(s): R Life Sciences, R Applications

Abstract:

Computational-mathematical models can be efficiently used to provide new insights into drivers of a disease spread, investigating different explanations of observed resurgence and predicting potential effects of different therapies. In this context, we present a new general modeling framework for the analysis of epidemiological and biological systems, characterized by features that make easy its utilization even by researchers without advanced mathematical and computational skills. The implementation of the R package, called “Epimod”, provides a friendly interface to access the model creation and the analysis techniques implemented in the framework. In details, by exploiting the graphical formalism of the Petri Net it is possible to simplify the model creation phase, providing a compact graphical description of the system and an automatically derivation of the underlying stochastic or deterministic process. Then, by using four functions it is possible to deal with Model Generation, Sensitivity Analysis, Model Calibration and Model Analysis phases. Finally, the Docker containerization of all analysis techniques grants a high level of portability and reproducibility. We applied Epimod to model pertussis epidemiology, investigating alternative explanations of its resurgence and to predict potential effects of different vaccination strategies.

Coauthor(s): Simone Pernice, Matteo Sereno, Marco Beccuti. .

1.3.32 CorrelAidX - Building R-focused Communities for Social Good on the Local Level

Regina Siegers, CorrelAidX Coordination

Track(s): R World

Abstract:

Data scientists with their valuable skills have enormous potential to contribute to the social good. This is also true for the R community - and R users seem to be especially motivated to use their skills for the social good, as the overwhelmingly positive reception of Julien Cornebise’s keynote “AI for Good” at useR2019 (Cornebise 2019) has shown. However, specific strategies for putting the abstract goal “use data science for the social good” into practice are often missing, especially in volunteering contexts like the R community, where resources are often limited.

In our talk, we present formats that we have implemented on the local level to build R-focused, data-for-good communities across Europe. Originating from the German data4good network CorrelAid with its over 1600 members, we have established 9 local CorrelAidX groups in Germany, the Netherlands, and France.

The specific formats build on a three-pillared concept of community building, namely group-bonding, social entrepreneurship and outreach. We present multiple examples that illustrate how our local chapters operate to put data science for good into practice - using the formats of data dialogues, local data4good projects, and CorrelAidX workshops. Lastly, we also outline possibilities to implement such formats in cooperation between CorrelAidX chapters and R community groups such as R user groups or RLadies chapters.

Coauthor(s): Konstantin Gavras .

1.3.33 Interactive visualization of complex texts

Renate Delucchi Danhier, Post-Doc, TU Dortmund

Track(s): R Dataviz & Shiny

Abstract:

Hundreds of speakers may describe the same circumstance - e.g. explaining a fixed route to a goal - without producing two identical texts. The enormous variability of language and the complexity involved in encoding meaning poses a real difficulty for linguists analyzing text databases. In order to aid linguists in identifying patterns to perform comparative research, we developed an interactive shiny app that enables quantitative analysis of text corpora without oversimplifying the structure of language. Route directions are an example of complex texts, in which speakers take cognitive decisions such as segmenting the route, selecting landmarks and organizing spatial concepts into sentences. In the data visualization, symbols and colors representing linguistic concepts are plotted into coordinates that relate the information to fixed points along the route. Six interconnected layers of meaning represent the multi-layered form-to-meaning mapping characteristic of language. The shiny app allows to select and deselect information on these different layers, offering a holistic linguistic analysis way beyond the complexity attempted within traditional linguistics. The result is a kind of visual language in itself that deconstructs the interconnected layers of meaning found in natural language.

Coauthor(s): Paula González Ávalos .

1.3.34 CONNECTOR: a computational approach to study intratumor heterogeneity.

Simone Pernice, Ph.D student at Department of Computer Science of the University of Turin

Track(s): R Life Sciences

Abstract:

Literature is characterized by a broad class of mathematical models which can be used for fitting cancer growth time series, but with no a global consensus or biological evidence that can drive the choice of the correct model. The conventional perception is that the mechanistic models enable the biological understanding of the systems under study. However, there is no way that these models can capture the variability characterizing the cancer progression, especially because of the irregularity and sparsity of the available data. For this reason, we propose CONNECTOR, an R package built on the model-based approach for clustering functional data. Such method is based on the clustering and fitting of the data through a combination of cubic natural splines basis with coefficients treated as random variables. Our approach is particularly effective when the observations are sparse and irregularly spaced, as growth curves usually are. CONNECTOR provides a tool set to easily guide through the parameters selection, i.e., (i) the dimension of the spline basis, (ii) the dimension of the mean space and (iii) the number of clusters to fit, to be properly chosen before fitting. The effectiveness of CONNECTOR is evaluated on growth curves of Patient Derived Xenografts (PDXs) of ovarian cancer. Genomic analyses of PDXs allowed correlating fitted and clustered PDX growth curves to cell population distribution.

Coauthor(s): Beccuti Marco, Sirovich Roberta, Cordero Francesca .

1.3.35 gWQS: An R Package for Linear and Generalized Weighted Quantile Sum (WQS) Regression

Stefano Renzetti, PhD Student at Università degli Studi di Milano

Track(s): R Machine Learning & Models, R Life Sciences

Abstract:

Weighted Quantile Sum (WQS) regression is a statistical model for multivariate regression in high-dimensional datasets commonly encountered in environmental exposures. The model constructs a weighted index estimating the mixture effect associated with all predictor variables on an outcome. The package gWQS extends WQS regression to applications with continuous, categorical and count outcomes. We provide four examples to illustrate the usage of the package.

Coauthor(s): Chris Gennings, Paul C. Curtin .

1.3.36 Transparent Journalism Through the Power of R

Tatjana Kecojevic, SisterAnalyst.org; founder and director

Track(s): R World

Abstract:

This study examines the often-tricky process of delivering data literacy programmes to professionals with most to gain from a deeper understanding of data analysis. As such, the author discusses the process of building and delivering training strategies to journalists in regions where press freedom is constrained by numerous factors, not least of all institutionalised corruption.

Reporting stories that are supplemented with transparent procedural systems are less likely to be contradicted and challenged by vested interest actors. Journalists are able to present findings supported by charts and info graphics, but these are open to translation. Therefore, most importantly, the data and code of the applied analytical methodology should also be available for scrutiny and is less likely to be subverted or prohibited.

As part of creating an accessible programme geared to acquiring skills necessary for data journalism, the author takes a step-by-step approach to discussing the actualities of building online platforms for training purposes. Through the use of grammar of graphics in R and Shiny, a web application framework for R, it is possible to develop interactive applications for graphical data visualisation. Presenting findings through interactive and accessible visualisation methods in a transparent and reproducible way is an effective form of reaching audiences that might not otherwise realise the value of the topic or data at hand.

The resulting ‘R toolbox for journalists’ is an accessible open-source resource. It can also be adapted to accommodate the need to provide a deeper understanding of the potential for data proficiency to other professions.

The accessibility of R allows for users to build support communities, which in the case of journalists is essential for information gathering. Establishing and implementing transparent channels of communication is the key to scrupulous journalism and is why R is so applicable to this objective.

1.3.37 What’s New in ShinyProxy

Tobias Verbeke, Managing Director, Open Analytics

Track(s): R Dataviz & Shiny

Abstract:

Shiny is nice technology to write interactive R-based applications. It is broadly adopted and the R community has collaborated on many interesting extensions. Until recently, though, deployments in larger organizations and companies required proprietary solutions. ShinyProxy fills this gap and offers a fully open source alternative to run and manage shiny applications at large. In this talk we detail the ShinyProxy architecture and demonstrate how it meets the needs of organizations. We will discuss how it scales to thousands of concurrent users and how it offers authentication and authorization functionality using standard technologies (LDAP, ActiveDirectory, OpenID Connect, SAML 2.0 and Kerberos). Also, we will discuss the management interface and how it allows to monitor application usage to collect usage statistics in event logging databases. Finally, we will demonstrate that Shiny applications can now be easily embedded in broader applications and (responsive) web sites using the ShinyProxy API. Learn how academic institutions, governmental organizations and industry roll out Shiny apps with ShinyProxy and how you can do this too. See https://shinyproxy.io.