5  Project Management

Data science projects in public health fail more often from unclear scope, misaligned expectations, and coordination breakdowns than from analytical errors. A rigorous analysis of the wrong question, delivered six months late to an audience that can’t act on it, is a failed project regardless of its technical quality. This chapter covers the non-technical practices that help projects succeed: defining scope before work begins, collaborating as a team, working with IT and data governance partners, and communicating findings to people who will actually use them.

For a broader treatment of managing data science work, Roger Peng, Brian Caffo, and Jeff Leek’s Executive Data Science (Peng, Caffo, and Leek 2015) is a concise, practical resource for both team leads and the analysts working alongside them.

5.1 Defining the Project

5.1.1 Start with the question

The most important work in a data science project often happens before any data is touched. Peng and Matsui’s The Art of Data Science (Peng and Matsui 2015) frames the entire analysis process around one deceptively simple task: state the question precisely. Vague questions produce vague analyses. “Can you look at overdose trends?” is not a question that will lead anywhere useful. “Has the age distribution of overdose deaths in our state shifted since 2019, and does the pattern differ by rural versus urban counties?” is a question that can be answered.

Before agreeing to take on a project, work with the requestor to articulate:

  • What specific question are we answering?
  • Who will use the findings, and what decision or action do they enable?
  • What would a satisfying answer look like – a number, a chart, a report, a recommendation?
  • What does success look like, and how will we know when we’ve reached it?

Getting these questions answered upfront is the difference between a project that delivers value and one that drifts.

5.1.2 Project charters

For anything beyond a quick turnaround request, a brief project charter documents the shared understanding between the data science team and its stakeholders before work begins. A charter doesn’t need to be a formal document – a shared notes page or a few paragraphs in a project tracking tool is sufficient. What matters is that the key parties have read it and agreed to it before anyone opens a dataset.

A useful project charter for public health data science covers:

  • The question: stated precisely, as described above.
  • Data sources and access: what data will be used, where it lives, and what approvals are needed (data use agreements, IRB review, IT provisioning).
  • Deliverables: what will actually be produced and in what format (report, dashboard, dataset, presentation slides)?
  • Timeline: key milestones and a realistic completion date, with explicit acknowledgment of dependencies.
  • Stakeholders: who the primary requestor is, who the audience is, who has final say on scope.
  • Out of scope: what the project will explicitly not address; this is often as important as what it will.

The charter is also the right place to surface risks early. If the analysis depends on data that hasn’t been collected yet, or on linking two systems that have never been joined, that should be visible before commitments are made.

5.2 Managing Scope

Scope creep – the gradual expansion of a project beyond its original definition – is endemic to data science work and is usually well-intentioned. Stakeholders see early results and ask “can you also look at…?”, and analysts, wanting to be helpful, say yes. Over time, a focused project becomes an unfocused one.

Make scope changes explicit. When a new request comes in mid-project, treat it as a scope change. Acknowledge it, discuss whether it belongs in the current project or a future one, and adjust the timeline if it’s added. The charter is the reference point for these conversations.

Protect the last mile. The gap between “the analysis is done” and “the findings are in front of the people who need them” is consistently larger than anticipated. Time for writing, review, revision, and presentation needs to be built into project plans – not squeezed in at the end.

Know when to say “not yet.” Not every analytic question can be answered well with available data, methods, or time. It is better to scope a project to what can be answered defensibly than to produce a shaky analysis of a broader question. A clear, well-supported answer to a narrow question is more useful to a program than a hedged non-answer to a large one.

5.3 Collaborating as a Team

5.3.1 Version control

The foundation of collaborative data science work is version control. Chapter 1 covers Git and GitHub in depth – the mechanics of branches, pull requests, and conflict resolution. The project management implication is straightforward: a shared repository is the canonical record of a project’s code and analysis. Work that lives only on one person’s laptop is not team work; it is a liability.

For collaborative projects, adopt a simple branching convention (see Section 1.2) where changes are reviewed before being merged into the main branch. Even lightweight review (e.g. a colleague reading a pull request before it’s merged) catches errors, shares knowledge, and prevents the “only one person understands this code” problem that becomes acute when someone leaves.

5.3.2 Reproducibility as a management goal

A reproducible analysis is one that can be re-run by a different person, on a different machine, at a future date, and produce the same result. In public health, this matters for accountability (can we explain exactly how this number was produced?), for updating results when new data arrives, and for knowledge transfer when team composition changes.

Reproducibility is a project management commitment. Building it in from the start costs far less than retrofitting it after a number is questioned.

5.3.3 Project structure

Consistent project organization makes it easier for team members to find things, onboard to a project quickly, and hand off work. A minimal structure that works well for most public health data science projects is shown below, but Chapter 2 covers project structure and data organization in more detail.

project-name/
├── data/
│   ├── .gitignore    # Don't commit data to version control.
│   ├── raw/          # original data -- never edited directly
│   └── processed/    # cleaned, analysis-ready files
├── R/                # scripts and functions
├── output/
│   ├── .gitignore    # Don't commit large output files.
│   ├── figures/
│   └── tables/
└── report.qmd

The most important convention: raw data is read-only. All transformations happen in code, so there is always a reproducible path from the original files to any result.

5.4 Working with Your IT Team

IT is a necessary partner in any institutional public health settings. Server access, database credentials, software installation, network permissions, and data transfer approvals all flow through IT. Projects that treat IT as an afterthought routinely stall at the moment they are otherwise ready to move.

5.4.1 Start early

IT requests take time, and often more time than anticipated. Getting new software approved and installed on a shared server may take days or weeks. Database access for a sensitive dataset may require formal requests, security reviews, and management sign-off at multiple levels. If the analysis depends on accessing a new data source or running on infrastructure you haven’t used before, initiate those conversations at project kickoff.

5.4.2 Be specific about what you need

A clear, specific request is far easier to fulfill than a vague one. When approaching IT, be prepared to specify:

  • What software or tools are needed, including version numbers if relevant.
  • What data needs to be accessed, where it lives, and in what format.
  • What level of compute and storage the analysis requires.
  • Whether the work involves sensitive or regulated data (HIPAA, PII, restricted-use files) that requires special handling.
  • Any hard deadlines driven by reporting requirements or funding timelines.

Framing requests in terms of the program need (e.g., “this analysis supports the quarterly overdose surveillance report due to the state health officer in March”) helps IT understand the stakes and prioritize accordingly.

5.4.3 Secure data environments

Much of the data used in public health analytics is sensitive: vital records, case surveillance data, insurance claims, linked longitudinal datasets. Familiarize yourself with your organization’s data governance policies and the specific terms of any data use agreements governing the datasets you work with. Some analyses must be conducted within designated secure environments (managed servers, data enclaves, air-gapped systems) and outputs may need to pass a disclosure review before leaving.

These constraints shape what tools and workflows are actually feasible. It is better to understand them at project start than to build an analysis pipeline that cannot be used where the data actually lives.

5.5 Communicating Findings

A finding that is not communicated effectively is a finding that does not exist, for practical purposes. Public health data science ultimately serves program and policy decisions made by people who did not run the analysis and may not read a methods section.

5.5.1 Know your audience

Different stakeholders need different things:

  • Program staff doing day-to-day work need actionable findings at the right level of detail, often a summary with supporting data available for those who want to dig in.
  • Program directors and health officers need the bottom line and its implications, typically in a brief format (one page, one slide) with uncertainty communicated plainly.
  • Methodologically sophisticated collaborators (epidemiologists, biostatisticians, peer reviewers) need full methods, sensitivity analyses, and honest discussion of limitations.

One analysis rarely serves all three audiences from a single document. Plan to produce multiple outputs from the same underlying work.

5.5.2 Communicate uncertainty honestly

Public health decisions are made under uncertainty, and analysis should convey what is known, what is estimated, and with what confidence – not project false precision. An administrator told “overdose deaths increased by exactly 12.3%” who later learns the estimate carried substantial uncertainty will trust the team less the next time. One given “we estimate a 10–15% increase, with the uncertainty driven primarily by changes in reporting completeness” has something they can actually act on.

That said, excessive hedging is its own failure mode. When the data clearly shows something, say so clearly. Uncertainty does not mean ambiguity about everything.

5.5.3 Match the output format to the need

A written report, an interactive dashboard, a two-page brief, and a slide deck are all appropriate in different contexts. The analysis determines what can be said; the output format determines whether it is heard. See Chapter 4 for guidance on building dashboards for data that needs to be explored or updated regularly.

5.6 Further Reading

  • Executive Data Science (Peng, Caffo, and Leek 2015). A short, practical book on managing data science teams and working with executive stakeholders. Essential reading for team leads; useful context for analysts.
  • The Art of Data Science (Peng and Matsui 2015). Covers the full data analysis process from question formulation through communication. The epicycles-of-analysis framework is particularly useful for understanding why projects rarely proceed in a straight line, and why that is normal.