Appendix A — Additional Resources

This appendix points to external resources for topics that go deeper than what this book covers. The selections here are oriented toward public health practitioners building data science skills in R, with emphasis on free and openly available material.

A.1 Learning R

R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund) is the standard starting point for learning the tidyverse. It covers data import, transformation with dplyr, visualization with ggplot2, and the broader workflow of doing data analysis in R. The second edition is freely available online. Most DSTT participants will benefit from working through at least the first half.

What They Forgot to Teach You About R (Bryan and Hester) is not about R syntax — it is about working in R effectively: project-oriented workflows, path discipline, R startup files, and the habits that separate reproducible work from fragile one-off scripts. Short and practical.

The tidyverse style guide documents naming conventions, spacing, and code organization conventions used across the tidyverse. Following a consistent style makes code easier to read, review, and maintain. The styler package can apply many of these rules automatically.

A.2 Going Deeper in R

Advanced R (Wickham) covers R’s object systems, functional programming, metaprogramming, and performance. It is aimed at people who already use R regularly and want to understand why things work the way they do, write more expressive code, and debug problems that go beyond “my package isn’t loading.”

R Packages (Wickham and Bryan) is the definitive guide to writing R packages. A package is the right container for analysis code that needs to be shared across projects or team members — it bundles functions, documentation, and tests in a standardized way that makes everything easier to reuse. Even teams that do not plan to publish to CRAN benefit from the structure packages impose. This book walks through the complete workflow using devtools and usethis.

Mastering Shiny (Wickham) covers building interactive web applications with Shiny. Shiny apps require a running R process (unlike Quarto dashboards) but support arbitrary interactivity — user-driven filtering, on-the-fly modeling, custom inputs. This book covers reactive programming, app architecture, and deployment.

A.3 Visualization

ggplot2: Elegant Graphics for Data Analysis (Wickham, Navarro, and Pedersen) goes beyond the basics of geom_* functions to cover the grammar of graphics underlying ggplot2, how to extend it, and how to build custom themes and scales. Useful for teams that produce a lot of publication-quality figures and want consistent, polished output.

R Graphics Cookbook (Chang) is organized as a collection of recipes: a problem followed by a solution. Less conceptual than the ggplot2 book, more useful as a quick reference when you know what you want a plot to look like but not how to produce it.

A.4 Modeling and Statistics

Tidy Modeling with R (Kuhn and Silge) is the companion book to the tidymodels ecosystem, a collection of packages that bring tidyverse conventions to statistical modeling: splitting data, specifying model workflows, tuning hyperparameters, and evaluating performance. Tidymodels works with hundreds of model types through a consistent interface. The book and the tidymodels website are the primary resources for teams moving from exploratory analysis into predictive modeling.

A.5 Public Health Data Science

The Epidemiologist R Handbook is a comprehensive, freely available reference manual written specifically for epidemiologists and public health practitioners using R. It covers data management, descriptive analyses, outbreak investigation, time series, spatial analysis, and reporting – all with worked examples using realistic public health data. For DSTT teams, this is the most directly applicable reference for day-to-day analytical work.

Building Reproducible Analytical Pipelines with R (Rodrigues) covers functional programming, package-based project structure, Docker, and Nix as a complete framework for building analyses that can be re-run reliably. It connects topics from the environments chapter (Chapter 11) and goes substantially deeper on each.

A.6 Databases from R

Most public health data lives in relational databases – SQL Server, PostgreSQL, Oracle, SQLite – not in CSV files on a shared drive. R can query these databases directly, returning results as data frames without any manual export step.

Posit’s database guide is the best starting point. It covers the overall architecture (DBI as the connection layer, specific driver packages per database type), connection setup in Positron and RStudio, and best practices for credentials and connection management.

DBI is the foundational package. It defines the interface that all R database drivers implement: dbConnect() to open a connection, dbGetQuery() to run a SQL query and return results, dbDisconnect() to close the connection. If you know the SQL you want to run, DBI is the direct path.

dbplyr extends dplyr to work against database tables. Instead of writing SQL, you write normal dplyr code — filter(), mutate(), summarize() — and dbplyr translates it to SQL and runs it on the database. This is useful when the team knows dplyr well and the database queries are not highly complex. collect() pulls the results into R as a local data frame.

A.7 APIs and Public Data

httr2 is the package for talking to web APIs from R, introduced in Chapter 13. Beyond the basics covered there, its documentation includes a “Wrapping APIs” vignette that walks through building a reusable interface to an API, useful if your team pulls from the same source in many projects.

Analyzing US Census Data (Walker) (Walker 2023) is the definitive guide to working with census data in R via the tidycensus package: ACS and decennial data, geography and mapping, and the margins-of-error handling that census estimates require. Freely available online. For public health teams, this is where rate denominators come from.

The Socrata developer documentation covers the API behind data.cdc.gov and many state open data portals, including the SoQL query language for filtering and aggregating on the server.

A.8 Migrating from Legacy Tools

R for Excel Users (Lowndes and Horst) is a free, workshop-format course aimed at analysts whose current workflow is Excel, covering the transition to R, RStudio/Positron, and reproducible project habits. A good group-study resource for teams working through the migration described in Chapter 14.

haven reads SAS, SPSS, and Stata files into R, preserving value labels and special missing values. Its documentation covers the details of labelled data that matter when converting legacy datasets to open formats.

A.9 Online Courses

The Johns Hopkins Data Science Specialization on Coursera is a 10-course sequence by Roger Peng, Jeff Leek, and Brian Caffo covering the complete applied data science workflow in R: tooling, data acquisition, exploratory analysis, reproducible research, statistical inference, regression, machine learning, and building data products. Individual courses can be audited free. The most standalone-useful courses for public health practitioners are:

R Programming: the core language, data structures, control flow, and functions. A solid foundation if you are new to R or coming from another language.
Exploratory Data Analysis: visualization and summarization as an investigative tool, not a reporting tool.
Reproducible Research: the rationale and practice of literate programming in R, closely aligned with the approach in Chapter 4.

The Johns Hopkins Executive Data Science Specialization (also by Peng, Leek, and Caffo) is a shorter, less technical sequence aimed at managers who lead or commission data science work. It pairs well with Chapter 18 and is useful for team leads who want to communicate more effectively with analysts.

A.10 Version Control

Happy Git and GitHub for the useR (Bryan) is the companion reference to Chapter 1. It covers installation and configuration in detail, the full range of workflows for collaborating on GitHub, and a substantial troubleshooting section for the inevitable moments when Git does something unexpected. Required reading for anyone who wants to move past the basics.

A.11 Quarto

The Quarto documentation at quarto.org is comprehensive and well-maintained. Key sections beyond what this book covers:

Quarto Projects — organizing multiple documents with shared configuration, a prerequisite for generating many parameterized reports at once.
GitHub Actions for Quarto — automating rendering and publishing on push, so reports update without manual intervention.
Quarto Extensions — community-contributed output formats, shortcodes, and filters that extend what Quarto can produce.