install.packages(c("DBI", "duckdb", "duckplyr", "dbplyr", "odbc"))11 Databases from R
Spreadsheets and CSV files work well for small datasets, but many public health agencies store their surveillance data in relational databases: SQL Server, Oracle, PostgreSQL, or similar systems. These databases are managed by IT, shared across teams, and updated continuously as new data arrives. The data never leaves the server; analysts connect, run a query, and pull back only the rows and columns they need.
This chapter covers how to work with databases from R. The examples use DuckDB, an in-process analytical database that requires no server setup and installs as an R package. DuckDB is practical for local analysis of large files, but the same patterns (DBI, dbplyr) apply to institutional databases like SQL Server and PostgreSQL — only the connection call changes. The chapter ends with connection patterns for those remote systems.
Getting access to an institutional database involves IT, data stewards, and sometimes a data use agreement. That process is covered in Chapter 15. This chapter covers what to do once access is in hand.
11.1 DBI: The Common Interface
The DBI package defines a consistent interface for interacting with databases from R. Every database backend (DuckDB, SQL Server, PostgreSQL, SQLite) provides its own driver package, but the functions you call are the same regardless of which system you are connected to.
The four core DBI functions:
| Function | What it does |
|---|---|
dbConnect() |
Open a connection to a database |
dbGetQuery() |
Run a SQL SELECT and return results as a data frame |
dbExecute() |
Run SQL that modifies data (no results returned) |
dbDisconnect() |
Close the connection |
A connection always needs to be closed when you are done. The safest pattern is on.exit(), which runs the disconnect even if an error occurs before the end of the block:
11.2 DuckDB
DuckDB is an in-process analytical database designed for fast queries over large datasets. It runs inside your R session with no separate server, installs as a single package, and can read CSV and Parquet files directly without loading them into R first. It is a good fit for local analysis of data that is too large to work with comfortably as an R data frame, and its DBI interface is identical to remote databases, so code written against DuckDB transfers to SQL Server or PostgreSQL with only the connection call changed.
11.2.1 Setting up
Connect with dbConnect(). An in-memory database (which disappears when the connection closes) is the default:
To persist the database to disk — so it survives between sessions — pass a file path:
11.2.2 Loading data
Use dbWriteTable() to load an R data frame into a DuckDB table. The examples in this chapter use a simulated weekly influenza surveillance dataset for Virginia counties, 2021–2023.
set.seed(42)
counties <- c(
"Fairfax",
"Arlington",
"Loudoun",
"Prince William",
"Alexandria",
"Chesterfield",
"Henrico",
"Richmond",
"Roanoke",
"Virginia Beach"
)
pop <- c(
1150000,
238000,
420000,
460000,
155000,
340000,
330000,
230000,
100000,
460000
)
flu <- expand.grid(
week = 1:52,
year = 2021:2023,
county = counties,
stringsAsFactors = FALSE
)
flu$cases <- pmax(0L, round(rnorm(nrow(flu), mean = 15, sd = 10)))
flu$population <- rep(pop, each = 156)
nrow(flu)[1] 1560
dbWriteTable(con, "flu", flu, overwrite = TRUE)To see which tables are in the database:
dbListTables(con)[1] "flu"
11.2.3 Querying with SQL
dbGetQuery() runs a SQL SELECT and returns the result as a data frame:
dbGetQuery(
con,
"
SELECT year, SUM(cases) AS total_cases
FROM flu
GROUP BY year
ORDER BY year
"
) year total_cases
1 2021 7852
2 2022 7802
3 2023 7732
DuckDB can also read CSV and Parquet files directly using SQL — without loading them into R first:
# Read a CSV directly in a query
dbGetQuery(con, "SELECT * FROM read_csv('data/flu-2024.csv') LIMIT 5")
# Read a folder of Parquet files
dbGetQuery(con, "SELECT COUNT(*) FROM read_parquet('data/flu-*.parquet')")11.3 dbplyr: dplyr Syntax on Any Database
Writing SQL by hand is workable for simple queries but awkward for anything that requires multiple joins, filters, and aggregations. The dbplyr package lets you write standard dplyr code against a database connection — it translates your code to SQL automatically.
11.3.1 Lazy tables
tbl() creates a reference to a database table. The result looks like a data frame but nothing has been pulled into R yet:
# Source: table<flu> [?? x 5]
# Database: DuckDB 1.4.4 [root@Darwin 24.6.0:R 4.5.2/:memory:]
week year county cases population
<int> <int> <chr> <dbl> <dbl>
1 1 2021 Fairfax 29 1150000
2 2 2021 Fairfax 9 1150000
3 3 2021 Fairfax 19 1150000
4 4 2021 Fairfax 21 1150000
5 5 2021 Fairfax 19 1150000
6 6 2021 Fairfax 14 1150000
7 7 2021 Fairfax 30 1150000
8 8 2021 Fairfax 14 1150000
9 9 2021 Fairfax 35 1150000
10 10 2021 Fairfax 14 1150000
# ℹ more rows
11.3.2 Writing queries with dplyr
dplyr verbs work on database tables the same way they work on data frames. The key difference is that operations are lazy: they accumulate until you ask for the results with collect().
# Source: SQL [?? x 3]
# Database: DuckDB 1.4.4 [root@Darwin 24.6.0:R 4.5.2/:memory:]
# Ordered by: desc(rate_per_100k)
county total_cases rate_per_100k
<chr> <dbl> <dbl>
1 Roanoke 709 709
2 Alexandria 690 445.
3 Richmond 851 370
4 Arlington 836 351.
5 Chesterfield 887 261.
6 Henrico 674 204.
7 Loudoun 845 201.
8 Prince William 816 177.
9 Virginia Beach 768 167
10 Fairfax 656 57
Notice the output header says Source: SQL [?? x 3] — the query has not run yet. Call collect() to execute and bring results into R:
collect(q)# A tibble: 10 × 3
county total_cases rate_per_100k
<chr> <dbl> <dbl>
1 Roanoke 709 709
2 Alexandria 690 445.
3 Richmond 851 370
4 Arlington 836 351.
5 Chesterfield 887 261.
6 Henrico 674 204.
7 Loudoun 845 201.
8 Prince William 816 177.
9 Virginia Beach 768 167
10 Fairfax 656 57
11.3.3 Inspecting generated SQL
show_query() reveals the SQL that dbplyr would send to the database. This is useful for debugging, learning SQL, or handing a query off to a colleague who prefers SQL:
show_query(q)<SQL>
SELECT
county,
SUM(cases) AS total_cases,
ROUND_EVEN((SUM(cases) / FIRST(population)) * 100000.0, CAST(ROUND(1.0, 0) AS INTEGER)) AS rate_per_100k
FROM (
SELECT flu.*
FROM flu
WHERE ("year" = 2023.0)
) q01
GROUP BY county
ORDER BY rate_per_100k DESC
The generated SQL is valid and efficient. For most queries, you never need to write SQL directly — but the ability to inspect it means you always know what is happening.
Pull as little data as possible. Filter, aggregate, and join on the database side before calling collect(). Pulling a million rows to R and then filtering in R is far slower than filtering in SQL and pulling only the rows you need.
11.4 duckplyr: DuckDB as a dplyr Backend
duckplyr takes a different approach. Instead of requiring an explicit connection, it makes DuckDB the execution engine for ordinary dplyr operations on data frames. You convert a data frame once with as_duckdb_tibble(), and subsequent dplyr operations run through DuckDB rather than R’s default in-memory engine.
library(duckplyr)
flu_duck <- as_duckdb_tibble(flu)The result looks and behaves like a tibble, but dplyr verbs execute via DuckDB:
# A tibble: 10 × 3
county total_cases rate_per_100k
<chr> <dbl> <dbl>
1 Roanoke 709 709
2 Alexandria 690 445.
3 Richmond 851 370
4 Arlington 836 351.
5 Chesterfield 887 261.
6 Henrico 674 204.
7 Loudoun 845 201.
8 Prince William 816 177.
9 Virginia Beach 768 167
10 Fairfax 656 57
11.4.1 How duckplyr differs from dbplyr
| dbplyr | duckplyr | |
|---|---|---|
| Connection | Explicit (dbConnect) |
None needed |
| Data location | Database table | R data frame |
| Use case | Query a database | Speed up work on large data frames |
| Result | Lazy (need collect()) |
Returned directly |
duckplyr is most useful when you have a large data frame already in R and want faster aggregations without the overhead of a separate database connection. dbplyr is the right tool when data lives in a database and you want to query it without pulling it all into R.
duckplyr falls back to standard dplyr automatically when it encounters a verb or expression it cannot execute through DuckDB. The output is always correct — the question is only whether the fast path was used. You can see fallback events with duckplyr::fallback_review().
11.5 Connecting to Institutional Databases
In practice, surveillance data often lives on a managed SQL Server, Oracle, or PostgreSQL server that IT administers. The odbc package provides R drivers for these systems. The connection syntax differs from DuckDB, but everything else — DBI functions, dbplyr queries — works identically.
11.5.1 SQL Server
The most common database at state and local health departments. Requires the ODBC Driver for SQL Server, which IT typically installs on managed machines:
library(odbc)
con <- dbConnect(
odbc(),
driver = "ODBC Driver 18 for SQL Server",
server = "db.health.state.gov",
database = "surveillance",
uid = Sys.getenv("DB_USER"),
pwd = Sys.getenv("DB_PASSWORD"),
encrypt = "yes"
)11.5.2 PostgreSQL
Common in cloud deployments and some state agencies:
con <- dbConnect(
odbc(),
driver = "PostgreSQL Unicode",
server = "db.health.state.gov",
database = "surveillance",
uid = Sys.getenv("DB_USER"),
pwd = Sys.getenv("DB_PASSWORD")
)11.5.3 Storing credentials safely
Never put database credentials in code or commit them to version control. Use environment variables instead. Store them in a project-level .Renviron file (which your .gitignore should exclude):
# .Renviron — do not commit this file
DB_USER=yourname
DB_PASSWORD=yourpassword
Retrieve them in code with Sys.getenv(), as shown in the connection examples above. An .Renviron file in the project root is loaded automatically when R starts in that directory. This keeps credentials out of scripts and repositories while making them available to anyone with appropriate access to the file.
Add .Renviron to your .gitignore before adding credentials to it. A credential committed to a repository is a credential that may be exposed.
11.5.4 Once connected, everything is the same
After the connection is open, all DBI and dbplyr patterns work identically regardless of which database system you are connected to:
11.6 Practical Guidance
Always close connections. Use on.exit(dbDisconnect(con), add = TRUE) immediately after opening a connection so it closes even if an error interrupts the script.
Use dbplyr instead of writing SQL. Unless you need a SQL feature that dbplyr does not support, dplyr verbs are easier to read, easier to debug, and work across database backends without modification.
Filter on the database side. Every row you pull into R costs time and memory. Apply filter(), group_by(), and summarize() before collect() so you bring back only the result, not the raw data.
Check the data types after collect(). Databases and R sometimes disagree on types (dates, integers, NULLs vs. NA). Validate with str() or glimpse() after the first collect() and add type coercions to your setup block if needed.