Reproduceable Workflows

RStudio/Posit Solutions Engineering (Lisa Anders)

RStudio, PBC

Outline

  • The data science workflow
  • New tools and tricks: Development, Sharing, and Production
  • Environment Management Strategies: Using renv and public package manager
  • Case Study

Data Science Workflow

Develop -> Share -> Productionize

“Illustration from Hadley Wickham’s talk”The Joy of Functional Programming (for Data Science).” by Allison Horst

New tools and tricks: Development

tidymodels - The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles.

vetiver - Vetiver provides fluent tooling to version, deploy, and monitor a trained model.

pins - The pins package publishes data, models, and other R objects, making it easy to share them across projects and with your colleagues.

plumber APIs - Plumber allows you to create a web API by merely decorating your existing R source code with roxygen2-like comments.

New tools and tricks: Sharing

quarto - A new open-source scientific and technical publishing system that works with R, Python, Julia, Javascript, and many other languages.

shinyuieditor - A visual tool for building the UI portion of a Shiny application that generates clean and human-readable code.

flexdashboard - Flexible and easy to specify row and column-based layouts. Components are intelligently re-sized to fill the browser and adapted for display on mobile devices.

rstudio/posit connect - Hosting for analytics, dashboards, API’s, pinned datasets while working in an enterprise environment.

New tools and tricks: Productionize

Version control - GitHub, Inc. is an Internet hosting service for software development and version control using Git. Other options include Bitbucket, Gitlab, and Azure DevOps.

renv - renv helps manage library paths (and other project-specific state) to help isolate your project’s dependencies.

rstudio/posit package manager - In addition to providing standard mirrors of CRAN, Bioconductor, and PyPI, you can track changes over time or freeze packages to specific versions, to help ensure reproducibility and ease collaboration.

Environment Management Strategies

Great data science work should be reproducible and collaborative.

Example: Using renv, git, and Public Package Manager

Step 1: Use pre-compiled packages

  • Go to Public Package Manager

  • Click on Get Started -> Setup -> Distribution and select your OS -> Select Latest or Freeze and follow the instructions below the calendar.

  • For example:

options(repos = c(REPO_NAME = "https://packagemanager.rstudio.com/all/latest"))

Example: Using renv, git, and Public Package Manager

Step 2: Use environment tracking

options(repos = c(REPO_NAME = "https://packagemanager.rstudio.com/all/latest")) # We've already done this

## Set up a new version controlled R project and install renv
install.packages("renv")
library(renv)

## Initialize your project with renv
renv::init()

## After creating an R script and loading a couple libraries take a snapshot of the project 
renv::snapshot():

## Repeat a couple times, changing the packages being called so that we see something interesting when we run:
renv::history()

## Optionally, revert back to an earlier snapshot:
renv::revert()

Example: Using renv, git, and Public Package Manager

Step 3: Easy collaboration

options(repos = c(REPO_NAME = "https://packagemanager.rstudio.com/all/latest")) # We've already done this

## Send a colleague the link to your project on git, they'll restore your environment with:
renv::restore()

Case Study

Where to go from here

What They Forgot to Teach You About R : https://rstats.wtf/

Happy Git with R : https://happygitwithr.com/

Get started with renv in the RStudio IDE: https://docs.posit.co/ide/user/ide/guide/environments/r/renv.html

Vetiver

Using Public Package Manager : https://support.rstudio.com/hc/en-us/articles/360046703913-FAQ-for-RStudio-Public-Package-Manager

Interested in our Enterprise Products? Click here