11  Databases from R

Spreadsheets and CSV files work well for small datasets, but many public health agencies store their surveillance data in relational databases: SQL Server, Oracle, PostgreSQL, or similar systems. These databases are managed by IT, shared across teams, and updated continuously as new data arrives. The data never leaves the server; analysts connect, run a query, and pull back only the rows and columns they need.

This chapter covers how to work with databases from R. The examples use DuckDB, an in-process analytical database that requires no server setup and installs as an R package. DuckDB is practical for local analysis of large files, but the same patterns (DBI, dbplyr) apply to institutional databases like SQL Server and PostgreSQL — only the connection call changes. The chapter ends with connection patterns for those remote systems.

Note

Getting access to an institutional database involves IT, data stewards, and sometimes a data use agreement. That process is covered in Chapter 15. This chapter covers what to do once access is in hand.

11.1 DBI: The Common Interface

The DBI package defines a consistent interface for interacting with databases from R. Every database backend (DuckDB, SQL Server, PostgreSQL, SQLite) provides its own driver package, but the functions you call are the same regardless of which system you are connected to.

install.packages(c("DBI", "duckdb", "duckplyr", "dbplyr", "odbc"))

The four core DBI functions:

Function What it does
dbConnect() Open a connection to a database
dbGetQuery() Run a SQL SELECT and return results as a data frame
dbExecute() Run SQL that modifies data (no results returned)
dbDisconnect() Close the connection

A connection always needs to be closed when you are done. The safest pattern is on.exit(), which runs the disconnect even if an error occurs before the end of the block:

con <- dbConnect(duckdb::duckdb())
on.exit(dbDisconnect(con), add = TRUE)

# ... work with the database ...

11.2 DuckDB

DuckDB is an in-process analytical database designed for fast queries over large datasets. It runs inside your R session with no separate server, installs as a single package, and can read CSV and Parquet files directly without loading them into R first. It is a good fit for local analysis of data that is too large to work with comfortably as an R data frame, and its DBI interface is identical to remote databases, so code written against DuckDB transfers to SQL Server or PostgreSQL with only the connection call changed.

11.2.1 Setting up

Connect with dbConnect(). An in-memory database (which disappears when the connection closes) is the default:

To persist the database to disk — so it survives between sessions — pass a file path:

con <- dbConnect(duckdb("surveillance.duckdb"))

11.2.2 Loading data

Use dbWriteTable() to load an R data frame into a DuckDB table. The examples in this chapter use a simulated weekly influenza surveillance dataset for Virginia counties, 2021–2023.

set.seed(42)
counties <- c(
  "Fairfax",
  "Arlington",
  "Loudoun",
  "Prince William",
  "Alexandria",
  "Chesterfield",
  "Henrico",
  "Richmond",
  "Roanoke",
  "Virginia Beach"
)
pop <- c(
  1150000,
  238000,
  420000,
  460000,
  155000,
  340000,
  330000,
  230000,
  100000,
  460000
)

flu <- expand.grid(
  week = 1:52,
  year = 2021:2023,
  county = counties,
  stringsAsFactors = FALSE
)
flu$cases <- pmax(0L, round(rnorm(nrow(flu), mean = 15, sd = 10)))
flu$population <- rep(pop, each = 156)

nrow(flu)
[1] 1560
dbWriteTable(con, "flu", flu, overwrite = TRUE)

To see which tables are in the database:

[1] "flu"

11.2.3 Querying with SQL

dbGetQuery() runs a SQL SELECT and returns the result as a data frame:

dbGetQuery(
  con,
  "
  SELECT year, SUM(cases) AS total_cases
  FROM flu
  GROUP BY year
  ORDER BY year
"
)
  year total_cases
1 2021        7852
2 2022        7802
3 2023        7732

DuckDB can also read CSV and Parquet files directly using SQL — without loading them into R first:

# Read a CSV directly in a query
dbGetQuery(con, "SELECT * FROM read_csv('data/flu-2024.csv') LIMIT 5")

# Read a folder of Parquet files
dbGetQuery(con, "SELECT COUNT(*) FROM read_parquet('data/flu-*.parquet')")

11.3 dbplyr: dplyr Syntax on Any Database

Writing SQL by hand is workable for simple queries but awkward for anything that requires multiple joins, filters, and aggregations. The dbplyr package lets you write standard dplyr code against a database connection — it translates your code to SQL automatically.

11.3.1 Lazy tables

tbl() creates a reference to a database table. The result looks like a data frame but nothing has been pulled into R yet:

library(dplyr)
library(dbplyr)

flu_tbl <- tbl(con, "flu")
flu_tbl
# Source:   table<flu> [?? x 5]
# Database: DuckDB 1.4.4 [root@Darwin 24.6.0:R 4.5.2/:memory:]
    week  year county  cases population
   <int> <int> <chr>   <dbl>      <dbl>
 1     1  2021 Fairfax    29    1150000
 2     2  2021 Fairfax     9    1150000
 3     3  2021 Fairfax    19    1150000
 4     4  2021 Fairfax    21    1150000
 5     5  2021 Fairfax    19    1150000
 6     6  2021 Fairfax    14    1150000
 7     7  2021 Fairfax    30    1150000
 8     8  2021 Fairfax    14    1150000
 9     9  2021 Fairfax    35    1150000
10    10  2021 Fairfax    14    1150000
# ℹ more rows

11.3.2 Writing queries with dplyr

dplyr verbs work on database tables the same way they work on data frames. The key difference is that operations are lazy: they accumulate until you ask for the results with collect().

q <- flu_tbl |>
  filter(year == 2023) |>
  group_by(county) |>
  summarize(
    total_cases = sum(cases, na.rm = TRUE),
    rate_per_100k = round(
      sum(cases, na.rm = TRUE) / first(population) * 100000,
      1
    )
  ) |>
  arrange(desc(rate_per_100k))

q
# Source:     SQL [?? x 3]
# Database:   DuckDB 1.4.4 [root@Darwin 24.6.0:R 4.5.2/:memory:]
# Ordered by: desc(rate_per_100k)
   county         total_cases rate_per_100k
   <chr>                <dbl>         <dbl>
 1 Roanoke                709          709 
 2 Alexandria             690          445.
 3 Richmond               851          370 
 4 Arlington              836          351.
 5 Chesterfield           887          261.
 6 Henrico                674          204.
 7 Loudoun                845          201.
 8 Prince William         816          177.
 9 Virginia Beach         768          167 
10 Fairfax                656           57 

Notice the output header says Source: SQL [?? x 3] — the query has not run yet. Call collect() to execute and bring results into R:

# A tibble: 10 × 3
   county         total_cases rate_per_100k
   <chr>                <dbl>         <dbl>
 1 Roanoke                709          709 
 2 Alexandria             690          445.
 3 Richmond               851          370 
 4 Arlington              836          351.
 5 Chesterfield           887          261.
 6 Henrico                674          204.
 7 Loudoun                845          201.
 8 Prince William         816          177.
 9 Virginia Beach         768          167 
10 Fairfax                656           57 

11.3.3 Inspecting generated SQL

show_query() reveals the SQL that dbplyr would send to the database. This is useful for debugging, learning SQL, or handing a query off to a colleague who prefers SQL:

<SQL>
SELECT
  county,
  SUM(cases) AS total_cases,
  ROUND_EVEN((SUM(cases) / FIRST(population)) * 100000.0, CAST(ROUND(1.0, 0) AS INTEGER)) AS rate_per_100k
FROM (
  SELECT flu.*
  FROM flu
  WHERE ("year" = 2023.0)
) q01
GROUP BY county
ORDER BY rate_per_100k DESC

The generated SQL is valid and efficient. For most queries, you never need to write SQL directly — but the ability to inspect it means you always know what is happening.

Tip

Pull as little data as possible. Filter, aggregate, and join on the database side before calling collect(). Pulling a million rows to R and then filtering in R is far slower than filtering in SQL and pulling only the rows you need.

11.4 duckplyr: DuckDB as a dplyr Backend

duckplyr takes a different approach. Instead of requiring an explicit connection, it makes DuckDB the execution engine for ordinary dplyr operations on data frames. You convert a data frame once with as_duckdb_tibble(), and subsequent dplyr operations run through DuckDB rather than R’s default in-memory engine.

The result looks and behaves like a tibble, but dplyr verbs execute via DuckDB:

flu_duck |>
  filter(year == 2023) |>
  group_by(county) |>
  summarize(
    total_cases = sum(cases),
    rate_per_100k = round(sum(cases) / first(population) * 100000, 1)
  ) |>
  arrange(desc(rate_per_100k))
# A tibble: 10 × 3
   county         total_cases rate_per_100k
   <chr>                <dbl>         <dbl>
 1 Roanoke                709          709 
 2 Alexandria             690          445.
 3 Richmond               851          370 
 4 Arlington              836          351.
 5 Chesterfield           887          261.
 6 Henrico                674          204.
 7 Loudoun                845          201.
 8 Prince William         816          177.
 9 Virginia Beach         768          167 
10 Fairfax                656           57 

11.4.1 How duckplyr differs from dbplyr

dbplyr duckplyr
Connection Explicit (dbConnect) None needed
Data location Database table R data frame
Use case Query a database Speed up work on large data frames
Result Lazy (need collect()) Returned directly

duckplyr is most useful when you have a large data frame already in R and want faster aggregations without the overhead of a separate database connection. dbplyr is the right tool when data lives in a database and you want to query it without pulling it all into R.

Note

duckplyr falls back to standard dplyr automatically when it encounters a verb or expression it cannot execute through DuckDB. The output is always correct — the question is only whether the fast path was used. You can see fallback events with duckplyr::fallback_review().

11.5 Connecting to Institutional Databases

In practice, surveillance data often lives on a managed SQL Server, Oracle, or PostgreSQL server that IT administers. The odbc package provides R drivers for these systems. The connection syntax differs from DuckDB, but everything else — DBI functions, dbplyr queries — works identically.

11.5.1 SQL Server

The most common database at state and local health departments. Requires the ODBC Driver for SQL Server, which IT typically installs on managed machines:

library(odbc)

con <- dbConnect(
  odbc(),
  driver = "ODBC Driver 18 for SQL Server",
  server = "db.health.state.gov",
  database = "surveillance",
  uid = Sys.getenv("DB_USER"),
  pwd = Sys.getenv("DB_PASSWORD"),
  encrypt = "yes"
)

11.5.2 PostgreSQL

Common in cloud deployments and some state agencies:

con <- dbConnect(
  odbc(),
  driver = "PostgreSQL Unicode",
  server = "db.health.state.gov",
  database = "surveillance",
  uid = Sys.getenv("DB_USER"),
  pwd = Sys.getenv("DB_PASSWORD")
)

11.5.3 Storing credentials safely

Never put database credentials in code or commit them to version control. Use environment variables instead. Store them in a project-level .Renviron file (which your .gitignore should exclude):

# .Renviron  — do not commit this file
DB_USER=yourname
DB_PASSWORD=yourpassword

Retrieve them in code with Sys.getenv(), as shown in the connection examples above. An .Renviron file in the project root is loaded automatically when R starts in that directory. This keeps credentials out of scripts and repositories while making them available to anyone with appropriate access to the file.

Warning

Add .Renviron to your .gitignore before adding credentials to it. A credential committed to a repository is a credential that may be exposed.

11.5.4 Once connected, everything is the same

After the connection is open, all DBI and dbplyr patterns work identically regardless of which database system you are connected to:

# Works on SQL Server, PostgreSQL, DuckDB, etc.
cases_tbl <- tbl(con, "flu_cases")

cases_tbl |>
  filter(year == 2023, county == "Fairfax") |>
  arrange(week) |>
  collect()

11.6 Practical Guidance

Always close connections. Use on.exit(dbDisconnect(con), add = TRUE) immediately after opening a connection so it closes even if an error interrupts the script.

Use dbplyr instead of writing SQL. Unless you need a SQL feature that dbplyr does not support, dplyr verbs are easier to read, easier to debug, and work across database backends without modification.

Filter on the database side. Every row you pull into R costs time and memory. Apply filter(), group_by(), and summarize() before collect() so you bring back only the result, not the raw data.

Check the data types after collect(). Databases and R sometimes disagree on types (dates, integers, NULLs vs. NA). Validate with str() or glimpse() after the first collect() and add type coercions to your setup block if needed.