Back to blog
·9 min read
Data
ETL
Case study

9 million rows, 17 years, one ETL: how I cleaned the H-1B dataset

17 years of US Department of Labor H-1B filings. About 9 million rows. 260 columns in the worst year. 4 different file eras that each renamed everything. SOC codes turned into dates by Excel. Wages encoded 3 different ways. Cities typed in as addresses. This is the battle log. Each rule the pipeline knows today is a scar from a specific fight.

MR
By Mathieu Régis
Freelance full-stack developer. Builds and ships websites that rank on Google and convert visitors.

How do you load a 260-column Excel file without crashing?

The first target was FY2024 Q4. 29 clean columns, sane names, Excel behaving. The pipeline read it in 2 minutes. Confidence was high. Then FY2019 arrived. 260 columns. Names ending in _1 through _10, one for each worksite. The same wage stored in WAGE_RATE_OF_PAY_FROM_1 and 9 other places. The naive loader hit 3GB of RAM and crashed. First lesson: read only what you need. Column-scoped reads cut memory by 8x and runtime by 5x.

How do you handle column names that change every 4 years?

FY2008 to FY2009 calls the primary key CASE_NO. FY2010 to FY2014 renames it to LCA_CASE_NUMBER with an LCA_CASE_ prefix on every field. FY2015 to FY2018 drops the prefix. FY2019 adds the _1 suffix. FY2020 settles on CASE_NUMBER. Every field had this history. EMPLOYER_NAME was once NAME, then LCA_CASE_EMPLOYER_NAME. SOC_TITLE was OCCUPATIONAL_TITLE, then LCA_CASE_SOC_NAME, then SOC_NAME.

The answer was a COLUMN_ALIASES dictionary with about 50 entries. Every historical name gets rewritten to the canonical FY2020+ form before any downstream logic runs. From that point on, the pipeline only speaks one language. Adding a new era becomes adding a few lines to the dictionary.

Why does Excel destroy SOC occupation codes?

SOC codes like 15-1132 describe the job (software developer in this case). Excel sees 15-1132 and thinks November 15, 1132 AD. It rewrites the cell to that date, stores a number internally, and the original code is gone. Sometimes Excel goes further and converts 15-1132 to the string Jan-15. Sometimes to Feb-32. Sometimes to 1132-01-15. All of those are irrecoverable.

The pipeline now detects 4 Excel damage patterns. Month-name SOC codes get nullified. ISO date SOC codes get nullified. Old DOT codes from pre-2000 filings get nullified (they do not map to modern SOC 1-to-1). 6-digit SOC codes without a dash (151132) get the dash reinserted. About 2 percent of SOC codes across 17 years hit one of these paths. Every rule is a fight that shipped.

How do you annualize 5 different wage unit formats?

Filers report wages in 5 units: Year, Month, Bi-Weekly, Week, or Hour. To compare salaries, everything has to become annual. Year x1. Month x12. Bi-Weekly x26. Week x52. Hour x2080 (40 hours x 52 weeks). Vectorized pandas does the math in milliseconds per file.

Older files abbreviate units differently. FY2008 uses lowercase yr, hr, mth, wk, bi. FY2009 flips to uppercase YR, HR, MTH, WK, BI. Miss the mapping and 1.2 million older rows produce NaN salaries because the multiplier lookup fails silently. The fix is a WAGE_UNIT_NORMALIZE dictionary that runs before any math touches the wage. FY2015 to FY2018 adds a new monster: wages stored as a single string like 66000 - 70000. The pipeline splits the string, coerces both halves to numeric, and falls back to the prevailing wage unit when the wage unit itself is missing.

When do you nullify a dirty value versus drop the row?

A wage of $0 per year is not a legitimate salary. It means the filer skipped the field. Nullify. A prevailing wage of $1,000 per hour is not legitimate either. Someone typed an annual salary in the hourly field. Nullify. A worksite city of SUITE 100 or 123 MAIN ST is not a city. Nullify. A state code that is actually a country (GEORGIA appeared 282 times in the country field, always for Atlanta and Alpharetta filings). Fix to GA state, move UNITED STATES OF AMERICA into the country field.

The rule that took the longest to learn: do not drop the row. Nullify the bad field. About 1 row in 300 has at least one junk value. Dropping whole rows would have lost millions of otherwise useful records. Only rows missing the primary key (case_number) get dropped entirely. Everything else survives with nulls where needed.

How do you verify an ETL pipeline output independently?

The ETL passes its own tests. That proves nothing. So a separate QA script was built, sharing zero logic with the ETL. It opens the final SQLite database as a black box and verifies every row against reference values: all 50 US states plus territories, valid case statuses (Certified, Certified - Withdrawn, Denied, Withdrawn), plausible salary ranges, SOC code format, date bounds. The ETL and QA cannot lie to each other because they never speak.

Then came verify_computed.py. Every derived column (salary_min, salary_max, prevailing_wage_annual, employer_name_clean) gets recomputed independently from the raw fields and compared to what the ETL wrote. A mismatch means the ETL is non-deterministic. Then compare_dbs.py runs the full ETL twice from scratch and diffs every row. Same inputs, same outputs, zero drift. That trio (QA, verify_computed, compare_dbs) is the quality contract.

What does a clean 9 million row SQLite database look like?

The output is one SQLite file. 9 million rows. 37 columns. 7 indexes covering the common query patterns (employer search, SOC code, state, fiscal year, salary range, status, job title). An FTS5 virtual table gives sub-millisecond search on employer name, job title, and city. Total size around 4GB. The ETL runs in 30 to 60 minutes on a laptop for the full history, or 2 minutes for a quarterly update. The next arc is already starting: FY2025 and FY2026 Q1 data is landing, and the column audit resets.

Frequently asked questions

Why SQLite and not PostgreSQL?

Read-heavy workload, single writer, no concurrent write contention. SQLite handles 9 million rows comfortably with proper indexes. No server process, no connection pool, no network hop. The whole database is one file you can ship, version, or swap in seconds. For this specific shape of workload it beats Postgres on latency and costs nothing to host.

How long does the full ETL take?

30 to 60 minutes on a 2024 MacBook Pro for 17 years of data. The largest files (FY2019 with 260 columns, FY2024 with full quarters) dominate the runtime. Quarterly updates add about 300k rows in under 2 minutes because they only touch one file. The real optimization levers are column-scoped Excel reads, vectorized pandas operations, and batched SQLite inserts.

What is the hardest part of this kind of pipeline?

Knowing when to nullify versus when to fix versus when to drop. Dropping a row is always the safest choice for correctness, but it hides scale. Fixing is risky because every fix embeds a judgment. Nullifying preserves the row while marking the field as unusable. The right default is almost always nullify, with drops reserved for rows that cannot be joined by the primary key.

Can this approach be reused for other government data?

Yes. USPTO patent data, SEC EDGAR filings, Census ACS extracts, EPA emissions records all have the same shape: multi-era schema drift, partial column renames, placeholder values, Excel damage. The ingredients are always a column alias dictionary, unit normalization tables, pattern-based junk detection, and a separate QA step that validates the output as a black box.

Need help with your website?

I build fast, SEO-ready websites for businesses serious about their online presence.