2 Data Organization

Good data organization is foundational to reproducible, efficient research. Decisions made at the moment of data collection and storage ripple through every downstream step—analysis, visualization, sharing, and archiving. This chapter covers practical principles for organizing data in spreadsheets, naming files consistently, structuring project directories, and storing data in formats that work well with code.

2.1 Data Organization in Spreadsheets

Broman and Woo’s “Data Organization in Spreadsheets” (Broman and Woo 2018) offers twelve practical recommendations for anyone who stores data in Excel or similar tools. The core philosophy: structure your data so that it can be read directly into R or Python without manual cleanup.

2.1.1 Be Consistent

Consistency is the single most important principle. Pick conventions and stick to them within a project:

Use the same codes for categorical variables (e.g., always "M" and "F", not sometimes "Male")
Use the same missing value code throughout (e.g., NA)
Use the same variable names across files and time points
Use the same date format everywhere

2.1.2 Make It a Rectangle

Raw data should be a single, clean rectangle:

Rows are observations (subjects, samples, time points)
Columns are variables
One header row at the top with short, meaningful column names
No merged cells, no blank rows used for visual separation, no summary rows mixed in

This is consistent with the principles of tidy data (Wickham 2014), where each variable forms a column, each observation forms a row, and each type of observational unit forms a table.

2.1.3 One Thing Per Cell

Each cell should contain exactly one value. Don’t cram multiple pieces of information into a single cell (e.g., "120/80" for blood pressure—use two columns instead). Don’t use a cell to store a note alongside a value.

2.1.4 Fill in All Cells

Don’t leave cells blank to imply “same as above.” Every cell should contain an explicit value. Use a consistent missing data code (e.g., NA) rather than leaving cells empty—blank cells are ambiguous: was the value missing, not collected, or zero?

2.1.5 Write Dates as YYYY-MM-DD

Use ISO 8601 format for all dates: 2024-03-15. This format:

Sorts correctly alphabetically
Is unambiguous across locales (03/04/2024 means different things in the US vs. Europe)
Is not mangled by Excel (unlike many other date formats)

Warning

Excel silently converts many text strings to dates (e.g., the human gene name SEPT1 becomes a date). Consider storing year, month, and day in separate columns if you’re collecting data in Excel, or prefix dates with an apostrophe to force text storage.

2.1.6 Choose Good Names for Things

Apply consistent naming rules to columns, sheets, and files:

No spaces—use underscores (_) or hyphens (-)
No special characters (@, #, %, (, ))
Short but meaningful (participant_id, not p or the_participant_identifier_number)
All lowercase is a safe default
Avoid starting names with numbers

2.1.7 No Calculations in Raw Data Files

Keep raw data raw. Don’t add sum rows, averages, or derived columns to the data file—do that in your analysis script. The raw data file is a permanent record of what was collected; calculations belong in code that can be inspected, corrected, and rerun.

2.1.8 Don’t Use Formatting as Data

Color-coding, bold text, or cell highlighting to indicate something (e.g., red = outlier, bold = reviewed) is invisible to code. Add an explicit column instead: is_outlier (TRUE/FALSE) or review_status (“reviewed”, “pending”).

2.1.9 Create a Data Dictionary

Every dataset should be accompanied by a data dictionary (also called a codebook) that documents:

Variable name
Description of what it measures
Units
Allowable values or range
How missing values are coded

A simple spreadsheet or CSV with one row per variable works fine.

2.1.10 Save Data in Plain Text

Save the analysis-ready version of your data as CSV (comma-separated values), not as .xlsx. CSV files:

Can be opened by any software
Are readable in version control diffs
Don’t have hidden formatting or macros
Are stable long-term

Keep the original Excel file if needed, but always export a clean CSV as the file you hand off to analysis.

2.1.11 Make Backups

Raw data should be treated as read-only and backed up in at least two places (e.g., local drive + cloud storage). Consider these tiers:

Primary: Your working copy
Local backup: External drive or NAS
Off-site/cloud: Institutional storage, OneDrive, or similar

Version control (see Chapter 1) is excellent for code and small text files but is not ideal for large binary data files.

2.2 File Naming

Jenny Bryan’s “How to Name Files” (Bryan 2015) lays out three principles that make file names work well for both humans and computers. See also her talk from NormConf 2022:

2.2.1 Three Principles

1. Machine readable

File names should be parseable by code without heroics:

No spaces (use _ or - instead)
No special characters (&, *, #, (, ), accented letters)
No case sensitivity issues—stick to lowercase
Use delimiters consistently so you can split on them: underscores between metadata fields, hyphens within a field

2. Human readable

The name should tell you what’s in the file without opening it:

Include a descriptive “slug” that summarizes the content
2024-03-15-enrollment-summary.csv is better than data_v3_FINAL.csv

3. Plays well with default ordering

Files should sort into a useful order automatically:

Use ISO 8601 dates (YYYY-MM-DD) at the start of the name so files sort chronologically
Left-pad numbers with zeros: fig-01.png, fig-02.png, …, fig-10.png (not fig-1.png, fig-2.png, fig-10.png)

2.2.2 Recommended Pattern: `YYYY-MM-DD-slug`

For dated outputs (reports, snapshots, exports), the pattern YYYY-MM-DD-descriptive-slug.ext works well:

2024-03-15-enrollment-summary.csv
2024-06-01-adverse-events-cleaned.csv
2024-09-30-q3-interim-analysis.html

For numbered sequences (figures, scripts with a natural order), use zero-padded numbers:

01-import-data.R
02-clean-data.R
03-fit-models.R
04-make-figures.R

For files that don’t change often and aren’t dated, a simple descriptive slug is fine:

protocol.pdf
data-dictionary.xlsx
consent-form-v2.pdf

Tip

A good file name is essentially metadata. When you can glob a directory and parse the file names with code—extracting dates, conditions, or sample IDs—you’ve done it right.

2.3 Project Directory Structure

2.3.1 Simple project-oriented workflow

A consistent folder structure makes projects navigable by collaborators (and your future self). One convention is shown below:

project-name/
├── data/
│   ├── .gitignore    # Don't commit data to version control.
│   ├── raw/          # original data -- never edited directly
│   └── processed/    # cleaned, analysis-ready files
├── R/                # scripts and functions
├── output/
│   ├── .gitignore    # Don't commit large output files.
│   ├── figures/
│   └── tables/
└── report.qmd

Key rules:

Raw data is sacred. The data/raw/ folder is read-only. Never overwrite or edit raw data files. Also, you don’t want to commit large files to version control, so be careful with your .gitignore.
All data manipulation is scripted. Cleaned data in data/processed/ is generated by a script, not by hand.
One project = one directory. Keep all files related to a project together.

The ProjectTemplate (R) and Cookiecutter Data Science (Python) tools can scaffold these structures automatically.

2.3.2 Version controlling code with data on a shared drive

Finally, it’s not uncommon in public health settings to have a setup where you’re working with collaborators version controlling code as described in Chapter 1, but the data lives on a secure shared drive that cannot be moved or copied into the repository. This creates a tension between reproducibility and security that can be resolved with careful project organization and path management.

Imagine that you have a distributed research team working with sensitive data that:

Lives on a secure shared drive that cannot be moved
Is organized in nested folders, for example, by state and year (e.g., Lab/StateA/2023/, Lab/StateB/2024/, etc.)
Requires collaborative code development across Windows, Mac, and Linux machines
Needs version control for reproducibility and collaboration

Storing data in a database would be preferable to a shared drive, but many public health agencies don’t have the resources or infrastructure to set up and maintain databases. So how can we manage this?

2.3.2.1 The Conflict

Traditional approaches create impossible trade-offs:

❌ Option 1: One code repository without proper path management

Forces hardcoded paths: setwd("Z:/Lab/StateA/2024")
Breaks on different operating systems (Z:/ vs /Volumes/)
Breaks when different team members have different drive mappings
Violatesproject-oriented workflow principles
Code is not reproducible

❌ Option 2: Code repositories inside each data folder

Creates dozens of scattered git repositories
Makes code reuse nearly impossible
Version control nightmare (which repo has what function?)
Accidental data commits to version control
Multiple people doing git operations in the same shared folder causes conflicts. Not merge conflicts, but file locking and permission issues that can break everyone’s workflow.

❌ Option 3: Move data into code repository

Violates data security requirements
Data cannot leave the secure shared drive
Makes git repository enormous and slow
Not feasible

2.3.2.2 A Better Way: Centralized Code Repository with Configurable Paths

✅ One way to manage this is with a single centralized code repository with configurable paths:

One repository for all analysis code (easy to find, maintain, collaborate)
Reproducible across team members with different operating systems
Secure: data never leaves the shared drive or enters version control
Flexible: works with nested folder structures
Collaborative: proper git workflow without conflicts
Project-oriented: no setwd(), paths are configurable not hardcoded

See stephenturner/demo-project-path-config for an example of one way to do this using configuration files. This can also be achieved with environment variables. How it works:

Code lives in a git repository on each person’s local machine
Each team member has a personal config.yml file (not version controlled) with their specific paths
Code uses helper functions to construct paths dynamically
Data stays on the shared drive (read-only for most operations)
Outputs go to each person’s local directories
Everyone commits code changes, never data

2.4 File Formats

Format	Best for	Avoid when
`.csv`	Tabular data, analysis input/output	Binary data, very wide datasets
`.tsv`	Tabular data with commas in values	—
`.json`	Nested/hierarchical data, configs	Large flat tables
`.parquet`	Large tabular data, columnar access	Simple small datasets
`.xlsx`	Data entry, sharing with non-coders	Final analysis-ready data

Plain text formats (CSV, TSV, JSON) are preferred for long-term storage and reproducibility. Proprietary binary formats (.xlsx, .sav, .sas7bdat) may become unreadable as software evolves.

2.5 Additional Best Practices

Use version control for code, not data. Git is designed for text files. Large or binary data files should live in dedicated storage (cloud drives, data repositories, databases) and be referenced in your code, not committed to the repository. Use a good .gitignore.

Document your cleaning steps. The gap between raw and processed data should be bridged entirely by a script that anyone can run. Comments in that script explaining why you made certain decisions are invaluable.

Plan for sharing. Before a project ends, ask: could someone else reproduce this analysis from the raw data? Good file organization, consistent naming, and a clear README make the answer yes.

Avoid “final” in file names. analysis-final-FINAL-v3-USE-THIS-ONE.R is a sign that version control was not used. Use Git instead, and name the file simply analysis.R.

2.1 Data Organization in Spreadsheets

2.1.1 Be Consistent

2.1.2 Make It a Rectangle

2.1.3 One Thing Per Cell

2.1.4 Fill in All Cells

2.1.5 Write Dates as YYYY-MM-DD

2.1.6 Choose Good Names for Things

2.1.7 No Calculations in Raw Data Files

2.1.8 Don’t Use Formatting as Data

2.1.9 Create a Data Dictionary

2.1.10 Save Data in Plain Text

2.1.11 Make Backups

2.2 File Naming

2.2.1 Three Principles

2.2.2 Recommended Pattern: YYYY-MM-DD-slug

2.3 Project Directory Structure

2.3.1 Simple project-oriented workflow

2.3.2 Version controlling code with data on a shared drive

2.3.2.1 The Conflict

2.3.2.2 A Better Way: Centralized Code Repository with Configurable Paths

2.4 File Formats

2.5 Additional Best Practices

2.2.2 Recommended Pattern: `YYYY-MM-DD-slug`