How we process 40 years of FEC filings, weekly

The Federal Election Commission publishes data on a rolling schedule: itemized contributions, committee filings, independent expenditures, candidate summaries, and a dozen peripheral tables. It’s all public. It’s all available for bulk download. Building a usable donor-intelligence product on top of it is not that hard, if you do it carefully. Done carelessly, it produces wrong numbers with a lot of confidence.

Here’s how our pipeline works, and why it rebuilds from scratch every week instead of patching.

The raw inputs

The FEC publishes roughly these datasets we care about:

Itemized individual contributions (Form 3, Schedule A) — the big one. Every federal contribution over $200, and most smaller ones, from 1975 onward.
Committee master — every registered federal committee, its type, treasurer, address, status.
Candidate master — every registered federal candidate, office sought, party, status.
Committee-to-candidate linkages — which committees support which candidates.
Independent expenditures — spending for/against candidates that doesn’t go through the candidate’s committee.
Transfers between committees — money movement.
Memo entries — notational filings that clarify other records.

The full itemized contribution table is about 400 million records at this point, depending on cycle aggregation. That sounds big; it isn’t really. A modern columnar database processes it in single-digit seconds for almost any useful query. The complexity is not storage. The complexity is that the data, as it arrives, is dirty.

Why the data is dirty

The FEC requires committees to file contribution data accurately, but “accurately” means “the data the committee captured from the donor.” Which is typically a handwritten check stub or a web form. Which means:

Donor names are spelled inconsistently across contributions. “Bob Smith” and “Robert Smith” and “Robt. Smith” and “Smith, Robert J.” may all be the same person.
Addresses drift. A donor gives from their summer home in July and their primary residence in October. Neither is “wrong”; they’re both real. But naive deduplication treats them as two people.
Employers and occupations are self-reported free text. “Goldman Sachs” appears as “Goldman,” “Goldman Sachs Group,” “GS,” “Goldman Sachs & Co”, and a dozen other variations.
PAC and committee names change mid-cycle as organizations rebrand.
Date fields occasionally have impossible values (2202 instead of 2022) that passed schema validation but failed reality.
Amendments supersede earlier filings. A committee files, then files an amendment, then files an amendment to the amendment. The pipeline needs to know which record is current.

None of this is fatal. All of it has to be handled, and handled the same way every time, or downstream donor models lose their ground truth.

The pipeline, step by step

Our FEC ingest runs weekly, on Sunday night into Monday morning, and operates on the full historical dataset from 1985 onward. Not a delta. The full thing.

Step 1: Fetch. We pull the latest bulk files directly from fec.gov. We also pull the daily filings feed, which includes the most recent 7 days at highest granularity. Full files for the period from 1985 through the most recent completed quarter; daily feed for the current quarter.

Step 2: Parse and validate. Every record goes through schema validation (correct column count, parseable dates, valid committee IDs). Records that fail validation go to a quarantine table with the specific failure reason. We typically see ~0.3% of records in quarantine — most are date-field glitches from older filings.

Step 3: Amendment resolution. Each committee filing has an ID, a filing date, and a sequence number. We walk the sequence and keep only the current version of each reported transaction. Superseded versions are preserved in the audit table, not the working table.

Step 4: Entity resolution. This is the work.

For donor records, we use probabilistic record linkage: a feature vector for each record (normalized name, address, occupation, employer, ZIP, cycle), and a learned similarity function that produces a match probability between any two records. Records above a high threshold cluster into an entity. Records above a lower threshold are flagged for human review.

For committees, resolution is mostly exact (committee IDs are stable), but we also carry a committee-alias table for known rebrands and successor organizations.

For employers, we use a normalized-employer dictionary that is about 60% automated and 40% hand-maintained. We care about this because employer-level giving patterns are real signal.

Step 5: Feature derivation. Each resolved donor entity gets a set of derived features computed: total giving history, count of cycles active, median gift, max gift, partisan mix, geographic centroid, industry breakdown. These feed the donor-propensity model.

Step 6: Publication. The cleaned, resolved, feature-enriched tables replace the previous week’s tables atomically. Queries against the old tables continue to work until the transaction commits; the switchover is invisible to users.

Why weekly, and why full rebuild

Two questions people ask:

Why not daily? The FEC daily feed is real but incomplete. Amendments, corrections, and late-filed contributions lag. Running the pipeline daily would force us to either republish inconsistent data every 24 hours or build a complex delta-reconciliation layer. Weekly is the natural cadence of FEC data and matches how campaigns actually use it (nobody needs a donor list that updates at 3am Tuesday).

Why full rebuild instead of incremental? Entity resolution is path-dependent. If a new record arrives that would have caused an older cluster to split or merge, incremental pipelines produce slightly different answers over time than a from-scratch rebuild. The difference is small in absolute terms but consequential for model training — you want the same input to produce the same output, always.

Full rebuild is cheap enough to do weekly (runs in under two hours on our current infrastructure). The cost of that compute is a rounding error compared to the value of consistency.

State data is different

State-level donor data, which we’ve started rolling out with Nevada, is structurally different. State publishers vary enormously in:

Format: Some publish clean CSVs. Some publish PDFs. A few publish data only through search interfaces that require scraping.
Cadence: Some update weekly. Some update on arbitrary schedules tied to filing deadlines.
Schema: Each state has its own field list, its own conventions, its own committee taxonomy.
Quality: Data cleaning effort per-state varies by a factor of 10.

We handle each state with a state-specific adapter that maps into our unified schema. The adapter handles ingestion, validation, and transformation. Downstream of the adapter, everything is the same — the same entity resolution, the same feature derivation, the same publication step.

Rolling out states one at a time is slow and deliberate on purpose. A state we’ve released is a state where the data is trustworthy for donor modeling, not a state where we’ve done a rough CSV import and hoped for the best.

What this means for users

If you query FEC data in Civitas on Tuesday morning, you’re looking at data that was refreshed Sunday night. The freshest individual contributions in the system are probably from the previous Friday (there’s a lag in FEC’s own reporting pipeline of a few days).

For state data, Nevada is on a similar Sunday-refresh schedule. Other states will be documented individually as they release.

This is all public methodology and will stay that way. If you want to dig into specific entity-resolution calls, the audit trail is queryable from any Max or Enterprise workspace. If you find a resolution you think is wrong, tell us — we fix specific cases and retrain the resolver on a rolling basis.

Boring pipelines are the unsexy underside of every data product. They also determine whether the numbers you’re looking at are real. Ours is boring on purpose.