---
config:
theme: 'default'
themeVariables:
'git0': '#006eff8e'
'git1': '#ffcc00ff'
---
gitGraph
commit id: "previous work"
branch feature/my-analysis
checkout feature/my-analysis
commit id: "add analysis"
commit id: "add figures"
checkout main
merge feature/my-analysis
commit id: "next commit"
1 Version Control
Git is a distributed version control system that tracks changes to files over time, allowing you to review history, revert mistakes, and collaborate without overwriting each other’s work. For data science teams, version control is essential for reproducibility (knowing exactly which code produced which results), collaboration (multiple people working on the same project safely), and auditability (a clear record of what changed and why). GitHub is the most widely used platform for hosting Git repositories and adds features like pull requests, code review, and issue tracking on top of Git’s core capabilities.
1.1 GitHub Setup
1.1.1 Creating a GitHub Account
Visit github.com and sign up for a free account. The free tier includes unlimited public and private repositories and is sufficient for most team workflows.
1.1.2 Adding an SSH Key
SSH keys let you authenticate with GitHub without entering your username and password on every push or pull. To generate a key:
ssh-keygen -t ed25519 -C "your_email@example.com"Accept the default file location (~/.ssh/id_ed25519) and optionally set a passphrase. Then copy the public key:
cat ~/.ssh/id_ed25519.pubIn GitHub, go to Settings → SSH and GPG keys → New SSH key, paste the output, give it a descriptive title (e.g., your machine name), and save.
Verify the connection works:
ssh -T git@github.comYou should see a message like Hi username! You've successfully authenticated.
1.1.3 Cloning a Repository via SSH
When cloning a repository, prefer the SSH URL over HTTPS. On any GitHub repository page, click Code and select the SSH tab to get the URL.
git clone git@github.com:org/repo.gitUsing SSH avoids repeated password prompts and works with SSH agent forwarding for server-based workflows.
1.1.4 Pulling and Pushing
After cloning, your local repository is linked to the remote (called origin by default). To sync your local branch with the latest changes from the remote:
git pullAfter committing changes locally, push them to the remote:
git pushGit tracks the relationship between your local branch and the corresponding remote branch automatically, so these short-form commands will work once the tracking relationship is established.
1.2 Branching and Merging
Branches let multiple people work on different features or fixes simultaneously without interfering with each other. The main branch (or master in older repositories) typically represents the stable, production-ready state of the project.
1.2.1 Internal Organizational Workflow
In most team workflows, the main branch is protected — direct pushes are disabled and changes must go through a pull request (PR) with at least one reviewer. Developers create short-lived branches for each piece of work, then open a PR when ready.
Step 1: Create a branch from main
git checkout main
git pull
git checkout -b feature/my-analysisStep 2: Make changes, stage, and commit
git add analysis.R writeup.qmd
git commit -m "Add initial exploratory analysis for Q1 data"Step 3: Push the branch to GitHub
git push -u origin feature/my-analysisThe -u flag sets the upstream tracking branch so future git push and git pull commands work without specifying the remote and branch name.
Step 4: Open a pull request
On GitHub, navigate to the repository. A banner will appear prompting you to open a PR from your recently pushed branch. Click Compare & pull request, add a description, assign reviewers, and submit.
Step 5: Review, merge, and delete
A reviewer approves the PR and merges it into main via the GitHub UI. After merging, delete the branch on GitHub (there is a button on the merged PR page). Locally, clean up with:
Code review is easier when diffs contain only meaningful changes. If one contributor reformats a file by hand and another does not, a PR’s diff may be dominated by whitespace changes that obscure the actual logic. Using a formatter like Air (see Section 12.2) on every save means diffs reflect intent, not style preferences.
git checkout main
git pull
git branch -d feature/my-analysisBranch naming conventions help the team understand what a branch is for at a glance. Common prefixes:
feature/— new functionality (e.g.,feature/survival-analysis)fix/— bug fixes (e.g.,fix/date-parsing-error)analysis/— exploratory or one-off analyses (e.g.,analysis/q2-cohort)docs/— documentation updates
Keep branches short-lived — ideally merged within a few days. Long-running branches diverge significantly from main, making merges painful and increasing the likelihood of conflicts.
1.2.2 Contributing to External Repositories
When contributing to a repository you don’t have write access to — such as an open-source tool or a public project from another team — the fork workflow is used instead. A fork is a personal copy of the repository under your GitHub account.
flowchart TD
U["upstream
ORIGINAL_ORG/repo"]
O["origin
YOUR_USERNAME/repo"]
L["local clone"]
U -->|"fork"| O
O -->|"clone"| L
L -->|"push"| O
O -->|"pull request"| U
U -->|"fetch upstream"| L
Step 1: Fork the repository
On the GitHub repository page, click Fork (top right) and choose your account as the destination.
Step 2: Clone your fork
git clone git@github.com:YOUR_USERNAME/repo.git
cd repoStep 3: Add the original repository as upstream
git remote add upstream git@github.com:ORIGINAL_ORG/repo.gitStep 4: Branch, work, commit, and push to your fork
git checkout -b fix/typo-in-readme
# ... make changes ...
git add README.md
git commit -m "Fix typo in installation section"
git push -u origin fix/typo-in-readmeStep 5: Open a pull request to the upstream repository
On your fork’s GitHub page, click Contribute → Open pull request. This creates a PR from your fork’s branch into the original repository’s main branch.
Step 6: Keep your fork in sync with upstream
As the original repository receives new commits, your fork will fall behind. Sync it with:
git fetch upstream
git checkout main
git merge upstream/main
git push origin mainIn the fork workflow, origin refers to your fork and upstream refers to the original repository. Keeping these straight avoids accidentally pushing to or pulling from the wrong remote.
1.3 Git in Positron
Positron includes a built-in Source Control panel (the branching icon in the left sidebar, or Ctrl+Shift+G / Cmd+Shift+G) that provides a graphical interface for the most common Git operations.
Viewing changes
The Source Control panel lists all files with uncommitted changes. Files are grouped into Staged Changes and Changes (unstaged). Click any file to open a diff view showing exactly what was added or removed.
Staging files
Hover over a file and click the + icon to stage it, or click the + next to the Changes heading to stage all modified files at once. To unstage, click the - icon next to a staged file.
Committing
Type a commit message in the text box at the top of the Source Control panel and click the Commit button (or press Ctrl+Enter / Cmd+Enter). This is equivalent to git commit -m "your message".
Branching
The current branch name appears in the status bar at the bottom of the window. Click it to open the branch picker, where you can switch to an existing branch or create a new one. This is equivalent to git checkout or git checkout -b.
Pushing and pulling
The Source Control panel has Push and Pull buttons in its toolbar (the ... menu or the sync icon in the status bar). The sync button performs a pull followed by a push in one action.
The integrated terminal in Positron (Terminal → New Terminal) is always available for Git operations that the UI doesn’t expose — such as adding a remote, rebasing, cherry-picking, or any command requiring flags. The UI and terminal work on the same repository state, so you can mix both freely.
1.4 Merge Conflicts
A merge conflict occurs when two branches have made changes to the same lines of the same file, and Git cannot automatically determine which version to keep. Conflicts most commonly arise during git merge, git rebase, or when accepting a pull request that conflicts with recent changes to main.
---
config:
theme: 'default'
themeVariables:
'git0': '#006eff8e'
'git1': '#ffcc00ff'
---
gitGraph LR:
commit id: "shared history"
branch feature-a
checkout feature-a
commit id: "edit analysis.R"
checkout main
branch feature-b
checkout feature-b
commit id: "also edit analysis.R"
checkout main
merge feature-a
merge feature-b id: "conflict!" type: HIGHLIGHT
1.4.1 What conflict markers look like
When a conflict occurs, Git edits the affected file to mark the conflicting regions:
<<<<<<< HEAD
flu_clean <- flu |> filter(cases > 0, !is.na(county))
=======
flu_clean <- flu |> filter(!is.na(county), cases >= 1)
>>>>>>> feature/update-filter-logic
- Everything between
<<<<<<< HEADand=======is the version from your current branch. - Everything between
=======and>>>>>>>is the version from the branch being merged in.
1.4.2 Resolving conflicts
Run
git merge <branch>(or let a PR trigger the conflict). Git will report which files have conflicts.Open each conflicted file. Positron highlights conflict regions with inline buttons: Accept Current Change, Accept Incoming Change, Accept Both Changes, or Compare Changes. Click the appropriate option or edit the file manually to produce the correct final version.
After resolving all conflicts in a file, save it and stage it:
git add path/to/resolved-file.R- Once all conflicted files are resolved and staged, complete the merge:
git commitGit will pre-populate a commit message describing the merge; you can accept it as-is.
Never leave conflict markers (<<<<<<<, =======, >>>>>>>) in committed code. The file will be syntactically broken and the code will not run. Always verify the conflict is fully resolved before staging.
git status is your best friend during a conflict. It lists which files are in conflict (both modified), which are already resolved and staged, and what step to take next (commit or continue a rebase).
git status1.5 .gitignore
A .gitignore file tells Git which files and directories to leave untracked. Anything listed there will not show up in git status, will not be staged by git add ., and will never be committed. Every project should have one.
1.5.1 Why it matters for data science
Data science projects routinely contain files that should never go into version control:
- Raw and processed data. Large files bloat a repository and slow every operation. Sensitive data files may have legal or compliance restrictions on where they can be stored. The project directory structure in Section 2.3 places a
.gitignoredirectly insidedata/andoutput/to block these files at the source. - Credentials and secrets.
.envfiles, config files with API keys or database passwords, and personal config files with paths to secure shared drives (see Section 2.3.2) should never be committed. Once a secret is in Git history, it is effectively public even if you delete it later. - Derived outputs. Figures, rendered reports, and cached results can be regenerated from code. Committing them creates noise in diffs and false impressions that something changed.
- Environment and editor artifacts. Files like
.DS_Store(macOS),.Rhistory,.RData, and IDE-specific folders communicate nothing useful to collaborators.
1.5.2 Syntax
A .gitignore at the root of the repository applies to the whole project. The syntax is straightforward:
# Lines starting with # are comments
# Ignore a specific file
.env
# Ignore all files with this extension
*.csv
# Ignore a directory and everything in it
data/raw/
# Ignore a pattern anywhere in the tree
**/.DS_Store
# Exception: track this one file even though *.csv is ignored
!data/raw/small-example.csv
Patterns without a / match anywhere in the tree. A leading / anchors to the repository root. A trailing / matches directories only.
1.5.3 A starting point for R and Python projects
GitHub maintains a collection of language-specific templates at github.com/github/gitignore. When you create a new repository on GitHub, you can select one of these as a starting point. For most data science work in R and Python, a reasonable .gitignore covers:
# R
.Rhistory
.RData
.Rproj.user/
*.Rproj # add back if you want to commit the project file
# Python
__pycache__/
*.py[cod]
.venv/
.env
# Data and outputs -- better handled with a .gitignore inside each folder
# (see project structure in the data organization chapter)
data/
output/
# OS artifacts
.DS_Store
Thumbs.db
# Credentials and personal config
config.yml # if it contains paths or secrets
*.key
*.pem
A .gitignore only affects untracked files. If you accidentally committed a file before adding it to .gitignore, it will continue to be tracked. To stop tracking it without deleting it locally:
git rm --cached path/to/fileThen commit the removal and add the file to .gitignore.
Per-directory .gitignore files work alongside the root one. The project structure in Section 2.3 places a .gitignore inside data/ and another inside output/. This is a good pattern: the intent is visible right where the data lives, and it protects those folders even if the root .gitignore is later modified.