2  Data Organization

Good data organization is foundational to reproducible, efficient research. Decisions made at the moment of data collection and storage ripple through every downstream step—analysis, visualization, sharing, and archiving. This chapter covers practical principles for organizing data in spreadsheets, naming files consistently, structuring project directories, and storing data in formats that work well with code.

2.1 Data Organization in Spreadsheets

Broman and Woo’s “Data Organization in Spreadsheets” (Broman and Woo 2018) offers twelve practical recommendations for anyone who stores data in Excel or similar tools. The core philosophy: structure your data so that it can be read directly into R or Python without manual cleanup.

2.1.1 Be Consistent

Consistency is the single most important principle. Pick conventions and stick to them within a project:

  • Use the same codes for categorical variables (e.g., always "M" and "F", not sometimes "Male")
  • Use the same missing value code throughout (e.g., NA)
  • Use the same variable names across files and time points
  • Use the same date format everywhere

2.1.2 Make It a Rectangle

Raw data should be a single, clean rectangle:

  • Rows are observations (subjects, samples, time points)
  • Columns are variables
  • One header row at the top with short, meaningful column names
  • No merged cells, no blank rows used for visual separation, no summary rows mixed in

This is consistent with the principles of tidy data (Wickham 2014), where each variable forms a column, each observation forms a row, and each type of observational unit forms a table.

2.1.3 One Thing Per Cell

Each cell should contain exactly one value. Don’t cram multiple pieces of information into a single cell (e.g., "120/80" for blood pressure—use two columns instead). Don’t use a cell to store a note alongside a value.

2.1.4 Fill in All Cells

Don’t leave cells blank to imply “same as above.” Every cell should contain an explicit value. Use a consistent missing data code (e.g., NA) rather than leaving cells empty—blank cells are ambiguous: was the value missing, not collected, or zero?

2.1.5 Write Dates as YYYY-MM-DD

Use ISO 8601 format for all dates: 2024-03-15. This format:

  • Sorts correctly alphabetically
  • Is unambiguous across locales (03/04/2024 means different things in the US vs. Europe)
  • Is not mangled by Excel (unlike many other date formats)
Warning

Excel silently converts many text strings to dates (e.g., the human gene name SEPT1 becomes a date). Consider storing year, month, and day in separate columns if you’re collecting data in Excel, or prefix dates with an apostrophe to force text storage.

2.1.6 Choose Good Names for Things

Apply consistent naming rules to columns, sheets, and files:

  • No spaces—use underscores (_) or hyphens (-)
  • No special characters (@, #, %, (, ))
  • Short but meaningful (participant_id, not p or the_participant_identifier_number)
  • All lowercase is a safe default
  • Avoid starting names with numbers

2.1.7 No Calculations in Raw Data Files

Keep raw data raw. Don’t add sum rows, averages, or derived columns to the data file—do that in your analysis script. The raw data file is a permanent record of what was collected; calculations belong in code that can be inspected, corrected, and rerun.

2.1.8 Don’t Use Formatting as Data

Color-coding, bold text, or cell highlighting to indicate something (e.g., red = outlier, bold = reviewed) is invisible to code. Add an explicit column instead: is_outlier (TRUE/FALSE) or review_status (“reviewed”, “pending”).

2.1.9 Create a Data Dictionary

Every dataset should be accompanied by a data dictionary (also called a codebook) that documents:

  • Variable name
  • Description of what it measures
  • Units
  • Allowable values or range
  • How missing values are coded

A simple spreadsheet or CSV with one row per variable works fine.

2.1.10 Save Data in Plain Text

Save the analysis-ready version of your data as CSV (comma-separated values), not as .xlsx. CSV files:

  • Can be opened by any software
  • Are readable in version control diffs
  • Don’t have hidden formatting or macros
  • Are stable long-term

Keep the original Excel file if needed, but always export a clean CSV as the file you hand off to analysis.

2.1.11 Make Backups

Raw data should be treated as read-only and backed up in at least two places (e.g., local drive + cloud storage). Consider these tiers:

  • Primary: Your working copy
  • Local backup: External drive or NAS
  • Off-site/cloud: Institutional storage, OneDrive, or similar

Version control (see Chapter 1) is excellent for code and small text files but is not ideal for large binary data files.

2.2 File Naming

Jenny Bryan’s “How to Name Files” (Bryan 2015) lays out three principles that make file names work well for both humans and computers. See also her talk from NormConf 2022:

2.2.1 Three Principles

1. Machine readable

File names should be parseable by code without heroics:

  • No spaces (use _ or - instead)
  • No special characters (&, *, #, (, ), accented letters)
  • No case sensitivity issues—stick to lowercase
  • Use delimiters consistently so you can split on them: underscores between metadata fields, hyphens within a field

2. Human readable

The name should tell you what’s in the file without opening it:

  • Include a descriptive “slug” that summarizes the content
  • 2024-03-15-enrollment-summary.csv is better than data_v3_FINAL.csv

3. Plays well with default ordering

Files should sort into a useful order automatically:

  • Use ISO 8601 dates (YYYY-MM-DD) at the start of the name so files sort chronologically
  • Left-pad numbers with zeros: fig-01.png, fig-02.png, …, fig-10.png (not fig-1.png, fig-2.png, fig-10.png)

2.3 Project Directory Structure

2.3.1 Simple project-oriented workflow

A consistent folder structure makes projects navigable by collaborators (and your future self). One convention is shown below:

project-name/
├── data/
│   ├── .gitignore    # Don't commit data to version control.
│   ├── raw/          # original data -- never edited directly
│   └── processed/    # cleaned, analysis-ready files
├── R/                # scripts and functions
├── output/
│   ├── .gitignore    # Don't commit large output files.
│   ├── figures/
│   └── tables/
└── report.qmd

Key rules:

  • Raw data is sacred. The data/raw/ folder is read-only. Never overwrite or edit raw data files. Also, you don’t want to commit large files to version control, so be careful with your .gitignore.
  • All data manipulation is scripted. Cleaned data in data/processed/ is generated by a script, not by hand.
  • One project = one directory. Keep all files related to a project together.

The ProjectTemplate (R) and Cookiecutter Data Science (Python) tools can scaffold these structures automatically.

2.3.2 Version controlling code with data on a shared drive

Finally, it’s not uncommon in public health settings to have a setup where you’re working with collaborators version controlling code as described in Chapter 1, but the data lives on a secure shared drive that cannot be moved or copied into the repository. This creates a tension between reproducibility and security that can be resolved with careful project organization and path management.

Imagine that you have a distributed research team working with sensitive data that:

  • Lives on a secure shared drive that cannot be moved
  • Is organized in nested folders, for example, by state and year (e.g., Lab/StateA/2023/, Lab/StateB/2024/, etc.)
  • Requires collaborative code development across Windows, Mac, and Linux machines
  • Needs version control for reproducibility and collaboration

Storing data in a database would be preferable to a shared drive, but many public health agencies don’t have the resources or infrastructure to set up and maintain databases. So how can we manage this?

2.3.2.1 The Conflict

Traditional approaches create impossible trade-offs:

❌ Option 1: One code repository without proper path management

  • Forces hardcoded paths: setwd("Z:/Lab/StateA/2024")
  • Breaks on different operating systems (Z:/ vs /Volumes/)
  • Breaks when different team members have different drive mappings
  • Violatesproject-oriented workflow principles
  • Code is not reproducible

❌ Option 2: Code repositories inside each data folder

  • Creates dozens of scattered git repositories
  • Makes code reuse nearly impossible
  • Version control nightmare (which repo has what function?)
  • Accidental data commits to version control
  • Multiple people doing git operations in the same shared folder causes conflicts. Not merge conflicts, but file locking and permission issues that can break everyone’s workflow.

❌ Option 3: Move data into code repository

  • Violates data security requirements
  • Data cannot leave the secure shared drive
  • Makes git repository enormous and slow
  • Not feasible

2.3.2.2 A Better Way: Centralized Code Repository with Configurable Paths

✅ One way to manage this is with a single centralized code repository with configurable paths:

  • One repository for all analysis code (easy to find, maintain, collaborate)
  • Reproducible across team members with different operating systems
  • Secure: data never leaves the shared drive or enters version control
  • Flexible: works with nested folder structures
  • Collaborative: proper git workflow without conflicts
  • Project-oriented: no setwd(), paths are configurable not hardcoded

See stephenturner/demo-project-path-config for an example of one way to do this using configuration files. This can also be achieved with environment variables. How it works:

  • Code lives in a git repository on each person’s local machine
  • Each team member has a personal config.yml file (not version controlled) with their specific paths
  • Code uses helper functions to construct paths dynamically
  • Data stays on the shared drive (read-only for most operations)
  • Outputs go to each person’s local directories
  • Everyone commits code changes, never data

2.4 File Formats

Format Best for Avoid when
.csv Tabular data, analysis input/output Binary data, very wide datasets
.tsv Tabular data with commas in values
.json Nested/hierarchical data, configs Large flat tables
.parquet Large tabular data, columnar access Simple small datasets
.xlsx Data entry, sharing with non-coders Final analysis-ready data

Plain text formats (CSV, TSV, JSON) are preferred for long-term storage and reproducibility. Proprietary binary formats (.xlsx, .sav, .sas7bdat) may become unreadable as software evolves.

2.5 Additional Best Practices

Use version control for code, not data. Git is designed for text files. Large or binary data files should live in dedicated storage (cloud drives, data repositories, databases) and be referenced in your code, not committed to the repository. Use a good .gitignore.

Document your cleaning steps. The gap between raw and processed data should be bridged entirely by a script that anyone can run. Comments in that script explaining why you made certain decisions are invaluable.

Plan for sharing. Before a project ends, ask: could someone else reproduce this analysis from the raw data? Good file organization, consistent naming, and a clear README make the answer yes.

Avoid “final” in file names. analysis-final-FINAL-v3-USE-THIS-ONE.R is a sign that version control was not used. Use Git instead, and name the file simply analysis.R.