It’s Dangerous to go Alone! Take This

Lisa Anders, Posit Solutions Engineer

2024-04-18

Intro

Lisa Anders - Posit Solutions Engineer and self-professed R nerd learning to love Python. Engineer turned data scientist turned Posit admin and excited to share lessons learned the hard way to make things easier for others. Also awkward at writing intros in the third person!

“Illustration from Hadley Wickham’s talk”The Joy of Functional Programming (for Data Science).” by Allison Horst

It’s Dangerous to go Alone! Take This

Imagine a world where code doesn’t disappear and is backed up in version control, environments are explicit and easily reproduced, and secrets are responsibly managed.

Version control your code

Reproduceable environments

Keep secrets secret

Version control your code

“Illustration from Jenny Bryan and Happy Git with R

Version control: What is this thing called “git”?

Who here has heard of git? Is anyone already using version control?

Version control: What is this thing called “git”?

Git was developed by the one and only Linus Torvalds (of linux fame), initially released in 2005 (19 years ago!)

Git itself is an open source utility, but most commonly it is used through a managed service (which are all pretty much the same under the hood). Here is a more complete list of version control software options:

  • Azure
  • Bitbucket
  • Github
  • Gitlab
  • Subversion (SVN)
  • Jenkins

Today’s focus will be on github

“Linux penguin from Wikipedia

Version control: But really, what is it anyway?

Think of a process with some document (“mythesis”):

  • Make a copy (“mythesis001”) and make your changes
  • Compare your copy to the original document (“Updated with better abstract!”)
  • Ask someone to review the differences (“LGTM!” - your advisor)
  • Pull in those changes

Horrible to self-manage, right? But having a software to do this for you is marvelous!

What git doesn’t do is “file locking”. But there are some new features added that now make this possible:

DEMO (google drive)

Version control: Credentials (PAT)

First, you’ll need to set up your credentials. You can do this either with a PAT or an SSH key.

PAT:

usethis::create_github_token()

Copy the generated PAT to your clipboard. Provide this PAT next time a Git operation asks for your password OR store the PAT explicitly.

gitcreds::gitcreds_set()

For example, you can store a PAT for a GitHub Enterprise deployment like so:

gitcreds::gitcreds_set("https://github.acme.com")

Add your email if needed:

git config --global user.email "lisa.anders@posit.co"

Check that it stored with:

gitcreds_get()

Version control: Credentials (SSH)

From terminal:

  • Generate a new key with: ssh-keygen -t ed25519 -C "your_email@example.com"
  • Add your ssh key to the background client: eval "$(ssh-agent -s)" ssh-add ~/.ssh/id_ed25519
  • Or find an existing key with: ls -al ~/.ssh
  • Copy the key to your clipboard: clip < ~/.ssh/id_ed25519.pub

Follow here to add it to your github account.

Follow here to remove / add a password.

You may also need to configure some global options:

git config --global user.name ""
git config --global user.email ""

You can check that SSH keys exist and the name with:

ls ~/.ssh
ssh-keygen -p

Version control: Credentials (SSH)

From RStudio:

  • Go to Tools -> Global Options -> Git / SVN
  • Create SSH Key
  • Approve the key and add a password (if appropriate)
  • View Public Key
  • Copy that public key that was provided into the SSH and GPG keys section in your git under your profile settings.

You’ll copy / update code using the SSH method from git.

Can also be done directly in the terminal

Version control: Oh no things have gone off the rails!

Troubleshooting from R:

library(usethis)
library(gitcreds)
library(gh)
library(credentials)

usethis::gh_token_help()
usethis::git_sitrep()
gh::gh_whoami()

# Remove any pre-existing credentials:
gitcreds::gitcreds_delete()

Troubleshooting from command line:

# Check the git configuration:
git config --list 

# Check if an existing repository was cloned with ssh or https:
git remote show origin

# You can check that SSH keys exist and the name with:
ls ~/.ssh
ssh-keygen -p

Version control: Oh no things have gone off the rails!

Troubleshooting git when working in cloud environments:

On a cloud instance check that is is writing to somewhere persistent, IE into /home/ and not into /opt/. Find where your library is:

.libPaths()

Change where git credentials are stored:

git config --global credential.helper 'store --file ~/.my-credentials'

Version control: so what are branches?

“Illustration by Allison Horst

Version control workflow

Teams standardize onto a branching philosophy:

  • Centralized workflow: Teams use only a single repository and commit directly to the main branch.

  • Feature branching: Teams use a new branch for each feature and don’t commit directly to the main branch.

  • GitFlow: An extreme version of feature branching in which development occurs on the develop branch, moves to a release branch, and merges into the main branch.

  • Personal branching: Similar to feature branching, but rather than develop on a branch per feature, it’s per developer. Every user merges to the main branch when they complete their work.

DEMO (personal repo, project repo)

“Branching strategies from Gitlab

Version control best practices

  • When in doubt, work in a branch
  • Write meaningful commit messages and commits (“Made ABC do 123”)
  • Make small changes and merge often
  • Feedback from reviews is invaluable

As a reviewer:

  • Clearly distinguish between required changes, suggested changes, and just good ol’ recommendations or thoughts. It’s easy for reviews to come across as micro-managing, and this can help in distinguishing between what changes are needed vs what is just knowledge sharing.

Version control advanced topic: pipelines

Demo

https://solutions.posit.co/operations/deploy-methods/ci-cd/

How I sleep knowing my code is safely backed up

How I sleep knowing my code is safely backed up

“Illustration by Flourish Australia

Questions?

Reproduceable environment management

Great data science work should be reproducible and collaborative. There are many ways to achieve this, but it’s easy to fall into some bad strategies.

Environment management example: Using renv, git, and Public Package Manager

Step 1: Use pre-compiled packages

  • Go to Public Package Manager

  • Click on Get Started -> Setup -> Distribution and select your OS -> Select Latest or Freeze and follow the instructions below the calendar.

  • For example:

options(repos = c(REPO_NAME = "https://packagemanager.rstudio.com/all/latest"))

Environment management example: Using renv, git, and Public Package Manager

Step 2: Use environment tracking

# Set up a new version controlled R project and install renv:
install.packages("renv")
library(renv)

# Initialize your project with renv and take a snapshot:
renv::init()
renv::snapshot():

# Update all packages, or revert back to an earlier snapshot:
renv::update()
renv::revert()

# History is saved into version control:
renv::history()

Environment management example: Using renv, git, and Public Package Manager

Step 3: Easy collaboration

# Have your colleague configure their repository to match yours: 
options(repos = c(REPO_NAME = "https://packagemanager.rstudio.com/all/latest")) 

## Send a colleague the link to your project on git, they'll restore your environment with:
renv::restore()

Environment management example: Using venv, git, and Public Package Manager

Follow the same steps, as with R

Step 1: Use pre-compiled packages

# Configure pip to use packages from public package manager: 
pip config set global.index-url https://packagemanager.posit.co/pypi/latest/simple
pip config set global.trusted-host packagemanager.posit.co

Step 2: Use environment tracking

# Activate your virtual environment: 
python -m venv .venv
. .venv/bin/activate

# Take a snapshot of the environment: 
pip freeze > requirements.txt

Step 3: Easy collaboration

# Send a colleague the link to your project on git, they'll restore your environment with:
pip install -r requirements.txt

Questions?

Keep secrets secret

Prevent exposing secret keys, passwords, sensitive config parameters in your code directly. This keeps those pieces more secure and less likely to escape into the wild.

What is a secret?

How many times are secrets exposed?

Secrets and RStudio

usethis has a function for creating and editing the .Renviron file:

library(usethis)

# Edit the global .Renviron file
usethis::edit_r_environ()

# Edit the project specific .Renviron file
usethis::edit_r_environ(scope = "project")

Add the variables to that file in the format variable_name = "variable_value" and save it. Restart the session so the new environment variables will be loaded with ctrl shift f10 or through the RStudio IDE with session -> restart R.

Saved variables can be accessed with:

variable_name <- Sys.getenv("variable_name")

Secrets and Python

Same as with R, environment variables can be referenced and used in your code:

import os

# Setting a new environment variable
os.environ["API_KEY"] = "YOUR_API_KEY"

# Retrieving the environment variable
var = os.environ["variable_name"]

Secrets and git

The .gitignore file is powerful! Tell git to ignore certain files that exist locally.

  • While typically explicitly listing the file name is the desired addition, wildcards can be added to exclude a type of file. For example: *.html.

Here’s an example .gitignore:

# History files
.Rhistory
.Rapp.history

# Session Data files
.RData

# Example code in package build process
*-Ex.R

# Output files from R CMD build
/*.tar.gz

# Output files from R CMD check
/*.Rcheck/

# RStudio files
.Rproj.user/

# produced vignettes
vignettes/*.html
vignettes/*.pdf

# OAuth2 token, see https://github.com/hadley/httr/releases/tag/v0.3
.httr-oauth

# knitr and R markdown default cache directories
/*_cache/
/cache/

# Temporary files created by R markdown
*.utf8.md
*.knit.md

# Shiny token, see https://shiny.rstudio.com/articles/shinyapps.html
rsconnect/

# Deployment details from rsconnect-python
rsconnect-python/

# Temporary files
.DS_Store
__pycache__
.ipynb_checkpoints

rmarkdown-notebook/flights.csv

.venv
venv
.env
.Rprofile

/.luarc.json

/.quarto/

Secrets and Posit Connect

Starting with version 1.6, RStudio Connect allows Environment Variables. The variables are encrypted on-disk, and in-memory.

Adding a variable through the Posit Connect UI is easy peasy:

Imagine a world

Imagine a world where code doesn’t disappear and is backed up in version control, environments are explicit and easily reproduced, and secrets are responsibly managed. Wouldn’t that be great?

TLDR

Version control:

Environment management:

Secrets:

Interested in our Enterprise Products? Click here

Questions?

Backup

Default Slide

This text size is normal.

This text size is smaller.

Bullet points

  1. Point 1
  2. Point 2

Blocking comment

Column 1

Column 2

But what about an image?

This is a centered citation”