2024-04-18
Lisa Anders - Posit Solutions Engineer and self-professed R nerd learning to love Python. Engineer turned data scientist turned Posit admin and excited to share lessons learned the hard way to make things easier for others. Also awkward at writing intros in the third person!
“Illustration from Hadley Wickham’s talk”The Joy of Functional Programming (for Data Science).” by Allison Horst
Imagine a world where code doesn’t disappear and is backed up in version control, environments are explicit and easily reproduced, and secrets are responsibly managed.
Version control your code
Reproduceable environments
Keep secrets secret
“Illustration from Jenny Bryan and Happy Git with R
Who here has heard of git? Is anyone already using version control?
Git was developed by the one and only Linus Torvalds (of linux fame), initially released in 2005 (19 years ago!)
Git itself is an open source utility, but most commonly it is used through a managed service (which are all pretty much the same under the hood). Here is a more complete list of version control software options:
Today’s focus will be on github
“Linux penguin from Wikipedia
Think of a process with some document (“mythesis”):
Horrible to self-manage, right? But having a software to do this for you is marvelous!
What git doesn’t do is “file locking”. But there are some new features added that now make this possible:
DEMO (google drive)
First, you’ll need to set up your credentials. You can do this either with a PAT or an SSH key.
PAT:
Copy the generated PAT to your clipboard. Provide this PAT next time a Git operation asks for your password OR store the PAT explicitly.
For example, you can store a PAT for a GitHub Enterprise deployment like so:
Add your email if needed:
Check that it stored with:
From terminal:
ssh-keygen -t ed25519 -C "your_email@example.com"
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_ed25519
ls -al ~/.ssh
clip < ~/.ssh/id_ed25519.pub
Follow here to add it to your github account.
Follow here to remove / add a password.
You may also need to configure some global options:
git config --global user.name ""
git config --global user.email ""
You can check that SSH keys exist and the name with:
ls ~/.ssh
ssh-keygen -p
From RStudio:
You’ll copy / update code using the SSH
method from git.
Can also be done directly in the terminal
Troubleshooting from R:
library(usethis)
library(gitcreds)
library(gh)
library(credentials)
usethis::gh_token_help()
usethis::git_sitrep()
gh::gh_whoami()
# Remove any pre-existing credentials:
gitcreds::gitcreds_delete()
Troubleshooting from command line:
Troubleshooting git when working in cloud environments:
On a cloud instance check that is is writing to somewhere persistent, IE into /home/ and not into /opt/. Find where your library is:
.libPaths()
Change where git credentials are stored:
“Illustration by Allison Horst
Teams standardize onto a branching philosophy:
Centralized workflow: Teams use only a single repository and commit directly to the main branch.
Feature branching: Teams use a new branch for each feature and don’t commit directly to the main branch.
GitFlow: An extreme version of feature branching in which development occurs on the develop branch, moves to a release branch, and merges into the main branch.
Personal branching: Similar to feature branching, but rather than develop on a branch per feature, it’s per developer. Every user merges to the main branch when they complete their work.
DEMO (personal repo, project repo)
“Branching strategies from Gitlab
As a reviewer:
Demo
“Illustration by Flourish Australia
Great data science work should be reproducible and collaborative. There are many ways to achieve this, but it’s easy to fall into some bad strategies.
Environments Strategy Maps https://solutions.rstudio.com/environments/reproduce/
Step 1: Use pre-compiled packages
Go to Public Package Manager
Click on Get Started -> Setup -> Distribution and select your OS -> Select Latest or Freeze and follow the instructions below the calendar.
For example:
Step 2: Use environment tracking
# Set up a new version controlled R project and install renv:
install.packages("renv")
library(renv)
# Initialize your project with renv and take a snapshot:
renv::init()
renv::snapshot():
# Update all packages, or revert back to an earlier snapshot:
renv::update()
renv::revert()
# History is saved into version control:
renv::history()
renv https://rstudio.github.io/renv, Package Manager https://rstudio.com/products/package-manager/ and Public Package Manager https://packagemanager.rstudio.com/
Step 3: Easy collaboration
# Have your colleague configure their repository to match yours:
options(repos = c(REPO_NAME = "https://packagemanager.rstudio.com/all/latest"))
## Send a colleague the link to your project on git, they'll restore your environment with:
renv::restore()
renv https://rstudio.github.io/renv, Package Manager https://rstudio.com/products/package-manager/ and Public Package Manager https://packagemanager.rstudio.com/
Follow the same steps, as with R
Step 1: Use pre-compiled packages
# Configure pip to use packages from public package manager:
pip config set global.index-url https://packagemanager.posit.co/pypi/latest/simple
pip config set global.trusted-host packagemanager.posit.co
Step 2: Use environment tracking
# Activate your virtual environment:
python -m venv .venv
. .venv/bin/activate
# Take a snapshot of the environment:
pip freeze > requirements.txt
Step 3: Easy collaboration
# Send a colleague the link to your project on git, they'll restore your environment with:
pip install -r requirements.txt
venv https://docs.python.org/3/library/venv.html, pip requirements.txt https://pip.pypa.io/en/stable/reference/requirements-file-format/, Package Manager https://rstudio.com/products/package-manager/ and Public Package Manager https://packagemanager.rstudio.com/
Prevent exposing secret keys, passwords, sensitive config parameters in your code directly. This keeps those pieces more secure and less likely to escape into the wild.
What is a secret?
How many times are secrets exposed?
usethis
has a function for creating and editing the .Renviron
file:
library(usethis)
# Edit the global .Renviron file
usethis::edit_r_environ()
# Edit the project specific .Renviron file
usethis::edit_r_environ(scope = "project")
Add the variables to that file in the format variable_name = "variable_value"
and save it. Restart the session so the new environment variables will be loaded with ctrl shift f10
or through the RStudio IDE with session
-> restart R
.
Saved variables can be accessed with:
Same as with R, environment variables can be referenced and used in your code:
The .gitignore
file is powerful! Tell git to ignore certain files that exist locally.
*.html
.Here’s an example .gitignore
:
# History files
.Rhistory
.Rapp.history
# Session Data files
.RData
# Example code in package build process
*-Ex.R
# Output files from R CMD build
/*.tar.gz
# Output files from R CMD check
/*.Rcheck/
# RStudio files
.Rproj.user/
# produced vignettes
vignettes/*.html
vignettes/*.pdf
# OAuth2 token, see https://github.com/hadley/httr/releases/tag/v0.3
.httr-oauth
# knitr and R markdown default cache directories
/*_cache/
/cache/
# Temporary files created by R markdown
*.utf8.md
*.knit.md
# Shiny token, see https://shiny.rstudio.com/articles/shinyapps.html
rsconnect/
# Deployment details from rsconnect-python
rsconnect-python/
# Temporary files
.DS_Store
__pycache__
.ipynb_checkpoints
rmarkdown-notebook/flights.csv
.venv
venv
.env
.Rprofile
/.luarc.json
/.quarto/
Starting with version 1.6, RStudio Connect allows Environment Variables. The variables are encrypted on-disk, and in-memory.
Adding a variable through the Posit Connect UI is easy peasy:
Imagine a world where code doesn’t disappear and is backed up in version control, environments are explicit and easily reproduced, and secrets are responsibly managed. Wouldn’t that be great?
Version control:
Environment management:
Secrets:
Interested in our Enterprise Products? Click here
This text size is normal
.
This text size is smaller
.
Bullet points
Blocking comment
Column 1
Column 2
“This is a centered citation”