1.1 Lightning talks
1.1.1 An enriched disease risk assessment model based on historical blood donors records
Andrea Cappozzo, PhD student at University of Milan-Bicocca
Track(s): R Applications
Abstract:
Historically, the medical literature has largely focused on determining risk factors at an illness-specific level. Nevertheless, recent studies suggested that identical risk factors may cause the appearance of different diseases in different patients (Meijers & De Boer, 2019).
Thanks to the joint collaboration of Heartindata, a group of data scientists offering their passion and skills for social good, and Avis Milano, the Italian blood donor organization, an enriched disease risk assessment model is developed. Multiple risk factors and donations drop-out causes are collectively analyzed from AVIS longitudinal records, with the final aim of providing a broader and clearer overview of the interplay between risk factors and associated diseases in the blood donors population.
Coauthor(s): Edoardo Michielon, Alessandro De Bettin, Chiara D’Ignazio, Luigi Noto, Davide Drago, Alberto Prospero, Francesca De Chiara, Sergio Casartelli .
1.1.2 rdwd: R interface to German Weather Service data
Berry Boessenkool, R trainer & consultant
Track(s): R Applications
Abstract:
rdwd is an R package to handle data from the German Weather Service (DWD). It allows to easily select, download and read observational data from over 6k weather stations. Both current data and historical records (partially dating back to the 1860s) are handled. Since about a year, gridded data from radar measurements can be read as well.
1.1.3 tv: Show Data Frames in the Browser
Christoph Sax, R-enthusiast, economist @cynkra
Track(s): R World
Abstract:
The tv package lively displays data frames during data analysis. It modifies the print method of data frames, tibbles or data tables to also appear in a browser or in the view pane of RStudio.
This is similar in spirit to the View() function in RStudio, works in other development environments, and has several advantages. Changes in data frame are shown immediately and next to the script and the console output, rather than on top of them. The display keeps the position and the width of columns if a modified data frame is shown in tv. It is updated asynchronously, without interrupting the analysis workflow.
Coauthor(s): Kirill Müller .
1.1.4 Predicting the Euro 2020 results using tournament rank probabilities scores from the socceR package
Claus Ekstrøm, Statistician at University of Copenhagen. Longtime R hacker.
Track(s): R Applications
Abstract:
The 2020 UEFA European Football Championship will be played this summer. Football championships are the source of almost endless predictions about the winner and the results of the individual matches, and we will show how the recently developed tournament rank probability score can be used to compare predictions.
Different statistical models form the basis for predicting the result of individual matches. We present an R framework for comparing different prediction models and for comparing predictions about the Euro results. Everyone is encouraged to contribute their own function to make predictions for the result of the Euro 2020 championship.
Each contributer will be shown how to provide two functions: a function that predicts the final score for a match between two teams with different skill levels, and a function that updates the skill levels based on the results of a given match. By supplying these two functions to the R framework the prediction results can be compared and the winner of the best football predictor can be found when Euro 2020 finishes.
1.1.5 Differential Enriched Scan 2 (DEScan2): an R pipeline for epigenomic analysis.
Dario Righelli, Department of Statistics, University of Padua, Post-Doc
Track(s): R Life Sciences
Abstract:
We present DEScan2, a R/Bioconductor package for the differential enrichment analysis of epigenomic sequencing data. Our method consists of three steps: peak caller, peak consensus across samples, and peak signal quantification. The peak caller is a standard moving scan window comparing the counts between a sliding window and a larger region outside the window, using a Poisson likelihood, providing a z-score for each peak. However, the package can work with any external peak caller: to this end, we provide additional functionalities to load peaks from bed files and handle them as internal optimized structures. The consensus step aims to determine if a peak is a “true peak” based on its replicability across samples: we developed a filtering step to filter out those peaks not present in at least a user given number of samples. A further threshold can be used over the peak z-scores. Finally, the third step produces a count matrix where each column is a sample and each row a previously filtered peak. The value of each matrix cell is the number of reads for the peak in the sample. Furthermore, our package provides several functionalities for common genomic data structure handling, for instance, to give the possibility to split the data over the chromosomes to speed-up the computations parallelizing them on multiple CPUs.
Coauthor(s): Koberstein John, Gomes Bruce, Zhang Nancy, Angelini Claudia, Peixoto Lucia, Risso Davide .
1.1.6 Ultra fast penalized regressions with R package {bigstatsr}
Florian Privé, Postdoc at Aarhus University
Track(s): R Machine Learning & Models
Abstract:
In this talk, I introduce the implementations of penalized linear and logistic regressions as implemented in R package {bigstatsr}. These implementations use data stored on disk to handle very large matrices. They automatically perform a procedure similar to cross-validation to choose the two hyper-parameters, λ and α, of the elastic net regularization, in parallel. They employ an early stopping criterion to avoid fitting very expensive models, making these implementations on average 10 times faster than with {glmnet}. However, package {bigstatsr} does not implement all the many models and options provided by the excellent package {glmnet}; some are area of future development.
1.1.7 Supporting Twitter analytics application with graph-databases and the aRangodb package
Gabriele Galatolo, Kode Srl, Software Developer & Data Scientist
Track(s): R Applications
Abstract:
The importance of finding efficient ways to model and to store unstructured data has incredibly grown in the last decade, in particular with the strong expansion of social-media services. Among those storing tools an increasingly important class of databases is represented by the graph-oriented databases, where relationships between data are considered first-class citizens. In order to support the analyst or the data scientist to interact and use in a simple way with this paradigm, we developed last year the package aRangodb, an interface with the graph-oriented database ArangoDB. To show the capabilities of the package and of the underlying way to model data using graphs we present Tweetmood, a tool to analyze and visualize tweets from Twitter. In this talk, we will present some of the most significant features of the package applied in the Tweetmood context, such as functionalities to traverse the graph and some examples in which the user can elaborate those graphs to get new information that can easily be stored using the functions and the tools available in the package.
Coauthor(s): Francesca Giorgolo, Ilaria Ceppa, Marco Calderisi, Davide Massidda, Matteo Papi, Andrea Spinelli, Andrea Zedda, Jacopo Baldacci, Caterina Giacomelli .
1.1.8 Reproducible Data Visualization with CanvasXpress
Ger Inberg, Freelance Analytics Developer
Track(s): R Dataviz & Shiny
Abstract:
canvasXpress was developed as the core visualization component for bioinformatics and systems biology analysis at Bristol-Myers Squibb. It supports a large number of visualizations to display scientific and non-scientific data. canvasXpress also includes a simple and unobtrusive user interface to explore complex data sets, a sophisticated and unique mechanism to keep track of all user customization for Reproducible Research purposes, as well as an ‘out of the box’ broadcasting capability to synchronize selected data points in all canvasXpress plots in a page. Data can be easily sorted, grouped, transposed, transformed or clustered dynamically. The fully customizable mouse events as well as the zooming, panning and drag-and-drop capabilities are features that make this library unique in its class.
1.1.9 Design your own quantum simulator with R
Indranil Ghosh, Final year post Graduate student from the department of Physics, Jadavpur University, Kolkata, India
Track(s): R Applications
Abstract:
The main idea of the project is to use the R ecosystem to write computer codes for designing a quantum simulator, for simulating different quantum algorithms. I will start with giving a brief introduction to linear algebra for starting with quantum computation, and how to write your own R codes from scratch to implement them. Then I will take a dive into implementing simple quantum circuits starting with initializing qubits and terminating with a measurement. I will also implement simple quantum algorithms concluding with giving a brief intro to quantum game theory and their simulations with R.
1.1.10 What are the potato eaters eating
Keshav Bhatt, R-fan and independent researcher
Track(s): R Applications, R Dataviz & Shiny
Abstract:
Although stereotypes can quite useful they are often not correct. For instance, the Dutch are stereotyped as being potato eaters. While this might have been historically correct, it is not currently accurate. The Dutch sparingly eat potatoes and this paper uses data to disprove the stereotype. To get an impression of Dutch food habits, a popular local website was scraped. Besides its popularity, the website hosts user-generated content, giving a good proxy of Dutch taste-buds. While it was apparent on the website, lasagna is the most popular dish. Detailed NLP analysis of more than 50,000 recipes showed that potato based dishes are in fact nowhere at the top. This vindicated my belief. Moreover, it shows that the Dutch kitchen is globalizing. Tomato, a hallmark of South Europe is more popular than the Dutch potato. Also observed is the popularity of many herbs in the recipes, which are not a traditional component of the Dutch kitchen. The world is changing and our kitchens too. This trend will also be explored for other countries also.
1.1.11 dm: working with relational data models in R
Kirill Müller, Clean code, tidy data. Consulting for cynkra, coding in the open.
Track(s): R Applications, R Production, R World
Abstract:
Storing all data related to a problem in a single table or data frame (“the dataset”) can result in many repetitive values. Separation into multiple tables helps data quality but requires “merge” or “join” operations. {dm} is a new package that fills a gap in the R ecosystem: it makes working with multiple tables just as easy as working with a single table.
A “data model” consists of tables (both the definition and the data), and primary and foreign keys. The {dm} package combines these concepts with data manipulation powered by the tidyverse: entire data models are handled in a single entity, a “dm” object.
Three principal use cases for {dm} can be identified:
When you consume a data model, {dm} helps access and manipulate a dataset consisting of multiple tables (database or local data frames) through a consistent interface.
When you use a third-party dataset, {dm} helps normalizing the data to remove redundancies as part of the cleaning process.
To create a relational data model, you can prepare the data using R and familiar tools and seamlessly export to a database.
The presentation revolves around these use cases and shows a few applications. The {dm} package is available on GitHub and will be submitted to CRAN in early February.
1.1.12 Explaining black-box models with xspliner to make deliberate business decisions
Krystian Igras, Data Scientists and Software Engineer at Appsilon
Track(s): R Machine Learning & Models
Abstract:
A vast majority of the state of the art ML algorithms are black boxes, meaning it is difficult to understand their inner workings. The more that algorithms are used as decision support systems in everyday life, the greater the necessity of understanding the underlying decision rules. This is important for many reasons, including regulatory issues as well as making sure that the model learned sensible features. You can achieve all that with the xspliner R package that I have created.
One of the most promising methods to explain models is building surrogate models. This can be achieved by inferring Partial Dependence Plot (PDP) curves from the black box model and building Generalised Linear Models based on these curves. The advantage of this approach is that it is model agnostic, which means you can use it regardless of what methods you used to create your model.
From this presentation, you will learn what PDP curves and GLMs are and how you can calculate them based on black box models. We will take a look at an interesting business use case in which we’ll find out whether the original black box model or the surrogate one is a better decision system for our needs. Finally, we will see an example of how you can explain your models using this approach with the xspliner package for R (already available on CRAN!).
1.1.13 Using open-access data to derive genome composition of emerging viruses
Liam Brierley, MRC Skills Development Fellow, University of Liverpool
Track(s): R Life Sciences
Abstract:
Outbreaks of new viruses continue to threaten global health, including pandemic influenza, Ebola virus, and the novel coronavirus ‘nCoV-2019’. Advances in genome sequencing allow access to virus RNA sequences on an unprecedented scale, representing a powerful tool for epidemiologists to understand new viral outbreaks.
We use NCBI’s GenBank, a curated open-access repository containing >200 million genetic sequences (3 million viral sequences) directly submitted by users, representing many individual studies. However, the resulting breadth of data and inconsistencies in metadata present consistent challenges.
We demonstrate our approach using R to address these challenges and a need for reproducibility as data increases. Firstly, we use rentrez
to programmatically search, filter, and obtain virus sequences from GenBank. Secondly, we use taxize
to resolve pervasive problems of naming conflicts, as virus names are often recorded differently between entries, partly because virus classification is complex and regularly revised. We successfully resolve 428 mammal and bird RNA viruses to species level before extracting sequences.
Obtaining genome sequences of a large inventory of viruses allows us to estimate genomic composition biases, which show promise in predicting virus epidemiology. Ultimately, this pathway will allow better quantification of future epidemic threats.
Coauthor(s): Anna Auer-Fowler, Maya Wardeh, Matthew Baylis, Prudence Wong .
1.1.14 A principal component analysis based method to detect biomarker captation from vibrational spectra
Marco Calderisi, Kode srl, CTO
Track(s): R Life Sciences
Abstract:
BRAIKER is a microfluidics-Based biosensor aimed to detect biomarkers. The device is responsive to changes of mass and viscosity over its surface. When selected markers react with the sensor, a variation of resonant acoustic frequencies (called harmonics) is produced. A serious problem when examining the data produced by biosensors is the subjectivity of standard method to evaluate the pattern of harmonics. In our research, a method based on the principal component analysis has been applied on vibrational data. An R-Shiny application was developed in order to present data visualizations and multivariate analyses of vibrational spectra. The Shiny application allows to clean and explore data by using interactive data visualisation tools. The principal component analysis is applied to analyse simultaneously the full set of frequencies for multiple experimental runs, reducing the multivariate data set into a small number of components accounting for a component of variance near to that the original data. Functionalised and non-functionalised resonating foils of biosensor can be classified in order to validate the capability of the device to detect biomarkers, lowering the LOD and increasing sensitivity and resolution.
Coauthor(s): Francesca Giorgolo, Ilaria Ceppa, Davide Massidda, Matteo Papi, Gabriele Galatolo, Andre Spinelli, Andrea Zedda, Jacopo Baldacci, Caterina Giacomelli, marco cecchini, matteo agostini .
1.1.15 An innovative way to support your sales force
Matilde Grecchi, Head of Data Science & Innovation @ZucchettiSpa
Track(s): R Production, R Dataviz & Shiny, R Machine Learning & Models, R Applications
Abstract:
Explanation of the web application realized in Shiny and deployed in production to support the sales force of Zucchetti. An overview of the overall step followed from data ingestion to modeling, from validation of the model to shiny web-app realization, from deployment in production to continous learning thanks to feedbacks coming from sales force and redemption of customers. All the code is written in R using RStudio. The deployment of the app is done with ShinyProxy.io
1.1.16 ptmixed: an R package for flexible modelling of longitudinal overdispersed count data
Mirko Signorelli, Dept. of Biomedical Data Sciences, Leiden University Medical Center
Track(s): R Machine Learning & Models, R Life Sciences
Abstract:
Overdispersion is a commonly encountered feature of count data, and it is usually modelled using the negative binomial (NB) distribution. However, not all overdispersed distributions are created equal: while some are severely zero-inflated, other exhibit heavy tails. Mounting evidence from many research fields suggests that often NB models cannot fit sufficiently well heavy-tailed or zero-inflated counts. It has been proposed to solve this problem by using the more flexible Poisson-Tweedie (PT) family of distributions, of which the NB is special case. However, current methods based on the PT can only handle cross-sectional datasets and no extension for correlated data is available. To overcome this limitation we propose a PT mixed-effects model that can be used to flexibly model longitudinal overdispersed counts. To estimate this model we develop a computational pipeline that uses adaptive quadratures to accurately approximate the likelihood of the model, and numeric optimization methods to maximize it. We have implemented this approach in the R package ptmixed, which is published on CRAN. Besides showcasing the package’s functionalities, we will present an assessment of the accuracy of our estimation procedure, and provide an example application where we analyse longitudinal RNA-seq data, which often exhibit high levels of zero-inflation and heavy tails.
Reference: Signorelli, M., Spitali, P., Tsonaka, R. (2020, in press). Poisson-Tweedie mixed-effects model: a flexible approach for the analysis of longitudinal RNA-seq data. To appear in Statistical Modelling. arXiv preprint: arXiv:2004.11193
Coauthor(s): Roula Tsonaka, Pietro Spitali .
1.1.17 One-way non-normal ANOVA in reliability analysis using with doex
Mustafa CAVUS, PhD Student @Eskisehir Technical University
Track(s): R Production, R Life Sciences, R Applications
Abstract:
One-way ANOVA is used for testing equality of several population means in statistics, and current packages in R provides functions to apply it. However, the violation of its assumptions are normality and variance heterogeneity limits its use, also not possible in some cases. doex provides alternative statistical methods to solve this problem. It has several tests based on generalized p-value, parametric bootstrap and fiducial approaches for the violation of variance heterogeneity and normality. Moreover, it provides the newly proposed methods for testing equality of mean lifetimes under different failure rates.
This talk introduces doex package provides has several methods for testing equality of population means independently the strict assumptions of ANOVA. An illustrative example is given for testing equality of mean of product lifetimes under different failure rates.
Coauthor(s): Berna YAZICI .
1.1.18 Keeping on top of R in Real-Time, High-Stakes trading systems
Nicholas Jhirad, Senior Data Scientist, CINQ ICT (on Contract to Pinnacle Sports)
Track(s): R Production
Abstract:
Visibility is the key to production. For R to work inside that environment, we need ubiquitous logging. I’ll share insights from our experience building a production-grade R stack and monitoring all of our R applications via syslog, the ‘rsyslog’ package (on CRAN) and splunk.
Coauthor(s): Aaron Jacobs .
1.1.19 Towards more structured data quality assessment in the process mining field: the DaQAPO package
Niels Martin, Postdoctoral researcher Research Foundation Flanders (FWO) - Hasselt University
Track(s): R Applications
Abstract:
Process mining is a research field focusing on the extraction of insights on business processes from process execution data embedded in files called event logs. Event logs are a specific data structure originating from information systems supporting a business process such as an Enterprise Resource Planning System or a Hospital Information System. As a research field, process mining predominantly focused on the development of algorithms to retrieve process insights from an event log. However, consistent with the “garbage in - garbage out”-principle, the reliability of the algorithm’s outcomes strongly depends upon the data quality of the event log. It has been widely recognized that real-life event logs typically suffer from a multitude of data quality issues, stressing the need for thorough data quality assessment. Currently, event log quality is often judged on an ad-hoc basis, entailing the risk that important issues are overlooked. Hence, the need for a more structured data quality assessment approach within the process mining field. Therefore, the DaQAPO package has been developed, which is an acronym for Data Quality Assessment of Process-Oriented data. It offers an extensive set of functions to automatically identify common data quality problems in process execution data. In this way, it is the first R-package which supports systematic data quality assessment for event data.
Coauthor(s): Niels Martin (Research Foundation Flanders FWO - Hasselt University), Greg Van Houdt (Hasselt University), Gert Janssenswillen (Hasselt University) .
1.1.20 Analyzing Preference Data with the BayesMallows Package
Øystein Sørensen, Associate Professor, University of Oslo
Track(s): R Machine Learning & Models
Abstract:
BayesMallows is an R package for analyzing preference data in the form of rankings with the Mallows rank model, and its finite mixture extension, in a Bayesian framework. The model is grounded on the idea that the probability density of an observed ranking decreases exponentially with the distance to the location parameter. It is the first Bayesian implementation that allows wide choices of distances, and it works well with a large number of items to be ranked. BayesMallows handles non-standard data: partial rankings and pairwise comparisons, even in cases including non-transitive preference patterns. The Bayesian paradigm allows coherent quantification of posterior uncertainties of estimates of any quantity of interest. These posteriors are fully available to the user, and the package comes with convenient tools for summarizing and visualizing the posterior distributions.
This talk will focus on how the BayesMallows package can be used to analyze preference data, in particular how the Bayesian paradigm allows endless possibilities in answering questions of interest with the help of visualization of posterior distributions. Such posterior summaries can easily be communicated with scientific collaborators and business stakeholders who may not be machine learning experts themselves.
Coauthor(s): Marta Crispino, Qinghua Liu, Valeria Vitelli .
1.1.21 Predicting Business Cycle Fluctuations Using Text Analytics
Sami Diaf, Researcher at the University of Hamburg
Track(s): R Machine Learning & Models
Abstract:
The use of computational linguistics proved to be crucial in studying macroeconomic forecasts and understanding the essence of such exercises.
Combining machine learning algorithms with text mining pipelines helps dissecting potential patterns of forecast errors and investigates the role of ideology in such outcomes.
The Priority Program “Exploring the Experience-Expectation Nexus” builds up, from a large database of German business cycle reports, advanced topic models and predictive analytics to investigate the role of ideology in the production of macroeconomic forecasts. The pipelines call for advanced data processing, predicting business fluctuations from text covariates, measuring ideological stances of forecasters and explaining what influences forecast errors.
1.1.22 Flexible deep learning via the JuliaConnectoR
Stefan Lenz, Statistician at the Institute of Medical Biometry and Statistics (IMBI), Faculty of Medicine and Medical Center – University of Freiburg
Track(s): R Machine Learning & Models
Abstract:
For deep learning in R, frameworks from other languages, e. g. from Python, are widely used. Julia is another language which offers computational speed and a growing ecosystem for machine learning, e. g. with the package “Flux”. Integrating functionality of Julia in R is especially promising due to the many commonalities of Julia and R. We take advantage of these in the design of our “JuliaConnectoR” R package, which aims at a tight integration of Julia in R. We would like to present our package, together with some deep learning examples. The JuliaConnectoR can import Julia functions, also from whole packages, and make them directly callable in R. Values and data structures are translated between the two languages. This includes the management of objects holding external resources such as memory pointers. The possibility to pass R functions as arguments to Julia functions makes the JuliaConnectoR a truly functional interface. Such callback functions can, e. g., be used to interactively display the learning process of a neural network in R while it is trained in Julia. Among others, this feature sets the JuliaConnectoR apart from the other R packages for integrating Julia in R, “XRJulia” and “JuliaCall”. This becomes possible with an optimized communication protocol, which also allows a highly efficient data transfer, leveraging the similarities in the binary representation of values in Julia and R.
Coauthor(s): Harald Binder .
1.1.23 Time Series Missing Data Visualizations
Steffen Moritz, Institute for Data Science, Engineering, and Analytics, TH Köln
Track(s): R Dataviz & Shiny, R Applications
Abstract:
Missing data is a quite common problem for time series, which usually also complicates later analysis steps. In order to deal with this problem, visualizing the missing data is a very good start.
Visualizing the patterns in the missing data can provide more information about the reasons for the missing data and give hints on how to best proceed with the analysis.
This talk gives a short intro into the new plotting functions being introduced with the 3.1 version of the imputeTS CRAN package.
Coauthor(s): Thomas Bartz-Beielstein .
1.1.24 effectclass: an R package to interpret effects and visualise uncertainty
Thierry Onkelinx, Statistician at the Research Institute for Nature and Forest
Track(s): R Dataviz & Shiny
Abstract:
The package classifies effects by comparing their confidence interval with a reference, a lower and an upper threshold, all of which are set by the user a priori. The null hypothesis is a good choice as reference. The lower and upper threshold define a region around the reference in which the effect is small enough to be irrelevant. These thresholds are ideally based on the effect size used in the statistical power analysis of the design. Otherwise they can be based on expert judgement.
The result is a ten-scale classification of the effect. Three classes exist for significant effects above the reference and three classes for significant effects below the reference. The remaining four classes split the non-significant effects. The most important distinction is between “no effect” and “unknown effect”.
effectclass provides ggplot2 add-ons stat_effect() and scale_effect() to visualise the effects as points with shapes depending on the classification. It provides stat_fan() which displays the uncertainty as multiple overlapping intervals with different confidence probability. stat_fan() is inspired by Britton, E.; Fisher, P. & J. Whitley (1998)
More details on the package website: https://effectclass.netlify.com/
Britton, E.; Fisher, P. & J. Whitley (1998). The Inflation Report Projections: Understanding the Fan Chart. Bank of England Quarterly Bulletin.