Memory Management and Large Files
memory-management.RmdOverview
BLSloadR handles large datasets from the Bureau of Labor Statistics, some exceeding 500MB in compressed format. This vignette explains how the package manages memory efficiently and how users can avoid crashes when working with large files.
Memory Optimizations (Automatic)
As of version 0.3.0, BLSloadR includes several automatic optimizations:
1. Smart Data Type Detection
Instead of reading all columns as character strings (memory-intensive), the package now:
- Samples the first 10,000 rows of large files (>50MB)
- Detects numeric columns automatically
- Uses optimized
colClassesto reduce memory usage by 50-70%
You don’t need to do anything - this happens automatically when you download data.
3. Improved Subset Caching
Functions like get_cps_subset() now:
- Check for valid cached subsets before downloading the master file
- Avoid unnecessary 300-500MB downloads when cached results exist
- Save filtered results separately for faster subsequent access
Reducing Memory Usage
Use Caching Strategically
Enable persistent caching to avoid re-downloading large files:
# Set environment variable (in .Renviron)
# USE_BLS_CACHE=TRUE
# Or use cache_dir parameter
get_cps_subset(
series_ids = "LNS14000000",
cache = TRUE,
cache_dir = "C:/my_bls_cache" # Persistent location
)Request Only Needed Series
Instead of downloading the full dataset and filtering in R, use built-in filters:
# GOOD: Download only specific series
unemployment <- get_cps_subset(
series_ids = c("LNS14000000", "LNS12000000"),
cache = TRUE
)
# LESS EFFICIENT: Download full file then filter
# (Though the package optimizes this with subset caching)Environment Variables for Testing
Skip Memory-Intensive Tests
If you’re developing or running tests in a memory-constrained environment:
# Set in .Renviron or before running tests
Sys.setenv(SKIP_MEMORY_TESTS = "TRUE")
# Then run tests
devtools::test()This will skip tests that download the large
ln.data.1.AllData file.
Control Cache Location
# Set custom cache directory
Sys.setenv(BLS_CACHE_DIR = "D:/BLS_data_cache")
# Enable caching globally
Sys.setenv(USE_BLS_CACHE = "TRUE")Troubleshooting Memory Issues
Symptoms
- R crashes during
devtools::test() - R crashes when calling
get_cps_subset()with characteristics - “Cannot allocate vector of size…” errors
- Positron/RStudio becomes unresponsive
Solutions
Close other applications to free RAM before downloading large files
-
Use caching to avoid re-downloading during repeated operations:
Sys.setenv(USE_BLS_CACHE = "TRUE") -
Use specific series IDs instead of characteristic filters when possible:
# Discover series first explore_cps_series( search = "unemployment", characteristics = list(sexs_code = "1") ) # Then download specific series get_cps_subset(series_ids = c("LNS13000001", "LNS14000001")) -
Skip memory-intensive tests during development:
Sys.setenv(SKIP_MEMORY_TESTS = "TRUE") -
Increase R’s memory limit (Windows only):
# Set to 8GB (if you have sufficient RAM) memory.limit(size = 8000) Use 64-bit R (not 32-bit) for accessing more memory
Understanding File Sizes
Common BLS dataset sizes:
| Dataset | Typical Size | In-Memory (Character) | In-Memory (Optimized) |
|---|---|---|---|
| CES AllData | ~500 MB | ~2-3 GB | ~800 MB - 1.2 GB |
| CPS (LN) AllData | ~400 MB | ~1.5-2 GB | ~600-800 MB |
| LAUS County | ~350 MB | ~1.2-1.8 GB | ~500-700 MB |
| CES Single State | ~10-30 MB | ~50-150 MB | ~20-60 MB |
Note: “In-Memory (Optimized)” reflects the improvements in BLSloadR 0.3.0+.
Best Practices
Start small: Test your code with specific states or series before downloading full datasets
Use persistent caching: Set up a permanent cache directory to avoid re-downloading
Monitor memory: Use
pryr::mem_used()or Task Manager to track memory usageClose and restart R: Between large operations to ensure clean memory state
Leverage subset caching: Functions like
get_cps_subset()cache filtered results, making subsequent queries fast
Example Workflow
# 1. Set up persistent caching (one time)
Sys.setenv(USE_BLS_CACHE = "TRUE")
Sys.setenv(BLS_CACHE_DIR = "C:/BLS_cache")
# 2. Explore available series
cps_explore <- explore_cps_series(search = "unemployment rate")
# 3. Download specific series (uses cache)
unemployment <- get_cps_subset(
series_ids = c("LNS14000000", "LNS14000001", "LNS14000002"),
cache = TRUE
)
# 4. Subsequent calls use cached subset (very fast)
unemployment_refresh <- get_cps_subset(
series_ids = c("LNS14000000", "LNS14000001", "LNS14000002"),
cache = TRUE
)Getting Help
If you continue to experience memory issues:
- Check your system’s available RAM
- Review the file size warnings in the console
- Consider using more targeted queries
- Open an issue at https://github.com/schmidtDETR/BLSloadR/issues with:
- Your system RAM
- The specific function call that’s failing
- Any error messages
The BLSloadR package is designed to handle large files efficiently, but memory constraints are ultimately limited by your system’s available RAM.