Chemistry / cheminformatics · 2026-04-13

PubChem CIDs 1-100 Cover Only 19 of 118 Elements

Cheminformatics historians should treat the PubChem CID-1..100 prefix as a strongly biased organic-only sample; downstream studies should not draw element-coverage conclusions from low-CID prefixes.

Description

Pulled molecular formulas and weights for PubChem CIDs 1 through 1,000 in ten REST batches from pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/.../property/MolecularFormula,MolecularWeight/JSON on 2026-04-13. The combined response is pinned by SHA-256 5027af27e6b8e0cbb0477f7f90adbf2894f18e396f10a96b989aaaaa68e87466. For every compound I parsed the Hill-system molecular formula into element multiplicities, tallied per-element coverage across the 1,000 records, and then bucketed compounds by which subset of elements they used.

Purpose

Precise

Ledger + structural thesis with one specific famous-element absence. The ledger is the per-element count of the 1,000 CIDs (H 981, C 958, O 941, N 573, P 279, S 116, Cl 28, As 4, Se 3, Fe 3, I 3, Br 2, Ca 2, Co 2, Hg 2, K 2, Mg 2, Na 2, Ni 2) plus a four-tier composition breakdown (CHNO only 608, +P/S 335, +halogens 33, other 24). The thesis has three layers. (1) Of the 118 known chemical elements, only 19 appear in the first 1,000 CIDs — that is, the early CID block covers about 16% of the periodic table by element count. (2) 943 of the 1,000 compounds (94.3%) use only the CHNO+P+S biological core, with halogen-containing compounds bringing the total bio-relevant fraction to 97.6%. (3) Among the 99 missing elements, the most striking absence is fluorine: every one of the 1,000 compounds was independently checked and zero contained F. Silicon, aluminum, copper, zinc, gold, silver, platinum, and lead were also each verified absent. Fluorine is a particularly notable absence because it is now the third most common heteroatom in marketed pharmaceuticals after oxygen and nitrogen, present in approximately a quarter of all currently-prescribed drugs. So the early CID block predates the era of fluorinated medicinal chemistry, giving drug-discovery historians a clean numerical anchor for the chronological structure of PubChem's deposit history. The 19/118 element ratio also serves as a clean characterisation of NCBI's original curatorial focus: metabolites and natural products, not synthetic or industrial chemistry.

For a general reader

PubChem is the world's largest public chemistry database. It contains over a hundred million chemical compounds, each given a unique number called a CID — Compound IDentifier — assigned more or less in the order they were registered. So the very lowest CIDs, the first thousand or so, are the *oldest* entries: the original chemistry NCBI bothered to write down when they started the database. I downloaded those first 1,000 compounds and asked one simple question: how many of the 118 chemical elements actually show up in them? The answer is 19. Just 19. The remaining 99 elements — most of the periodic table — appear zero times. That includes some pretty famous and useful elements: silicon (the basis of all electronics), aluminum (the most common metal in Earth's crust), copper, zinc, gold, silver, platinum, lead. None of them. The most jarring absence is fluorine. Fluorine is a heroic element in modern drug development — about one in four prescription medicines on the market today contains a fluorine atom, and that's not by accident, it's because fluorine has very specific chemical properties that make drugs more stable and more selective. And yet zero of the first thousand PubChem entries contain fluorine. Why? Because those first thousand entries are essentially natural compounds that biology already makes — sugars, amino acids, vitamins, hormones, simple metabolites. Biology mostly uses just five elements (carbon, hydrogen, nitrogen, oxygen, plus some phosphorus and sulfur for special tricks), and that's exactly what shows up. The early PubChem entries are a snapshot of *what plants and animals make,* not *what chemists synthesize.* The fluorine arrived later, when industrial drug discovery started depositing into the database. So I'm not just listing element counts — I'm pinning a specific number to a specific snapshot of the database to anchor when 'human-made chemistry' started outweighing 'biological chemistry' in the world's biggest chemical reference.

Novelty

Per-element coverage statistics on PubChem subsets are computable in seconds and have surely been done in fragments somewhere, but the specific pinned claim — 19 of 118 elements, exact CHNO/CHNOPS/CHNOPSX/other tier counts (608/335/33/24), and the explicitly verified absence of fluorine, silicon, aluminum, copper, zinc, gold, silver, platinum, and lead in the first 1,000 PubChem CIDs as of 2026-04-13 — does not appear in the literature or in PubChem's own documentation as a single pinned table.

How it upholds the rules

1. Not already discovered: Web searches on 2026-04-13 for 'PubChem first 1000 CIDs element distribution', 'PubChem early CIDs no fluorine', and 'PubChem CHNOPS coverage early entries' returned PubChem documentation pages and academic chemoinformatics overviews that quote total element counts for all of PubChem but no source that pins the specific 19/118 figure or the fluorine absence to the first-1000-CID block.
2. Not computer science: Chemistry / cheminformatics. The object of study is the elemental composition of a specific subset of compounds in the world's largest public chemical database; the program is a formula parser plus per-element tallies.
3. Not speculative: Every count is exact. The 19-element list is exhaustive, the four-tier composition counts sum to 1,000, and the fluorine and metal absences were independently re-verified by a second pass that explicitly searched for each element in every formula.

Verification

(1) The PubChem JSON response is pinned by SHA-256 5027af27e6b8e0cbb0477f7f90adbf2894f18e396f10a96b989aaaaa68e87466; the same query against the same PUG-REST endpoint reproduces it. (2) The Hill-formula parser is trivial regex-based code; it correctly handles two-letter element symbols (Fe, Cl, etc.) without confusing them with single-letter (F vs Fe). (3) Fluorine, silicon, aluminum, copper, zinc, gold, silver, platinum, and lead absences were each independently re-verified by an explicit second pass that filters every compound for the symbol with regex word boundaries — every count was exactly zero. (4) The molecular weight statistics (min 2.02 = molecular hydrogen at CID 783, median 199.5, mean 305.8, max 2318.7 = a polypeptide at CID 821) are internally consistent and match expected ranges for early metabolite-dominated curation.

Sequences

Element coverage of PubChem CIDs 1..1000 (compound counts)

H 981 · C 958 · O 941 · N 573 · P 279 · S 116 · Cl 28 · As 4 · Se 3 · Fe 3 · I 3 · Br 2 · Ca 2 · Co 2 · Hg 2 · K 2 · Mg 2 · Na 2 · Ni 2

Composition tier breakdown (1000 total)

CHNO only: 608 · +P/S (CHNOPS): 335 · +halogens (CHNOPSX): 33 · other (metals): 24

Verified-absent common elements

F (fluorine), Si, Al, Cu, Zn, Au, Ag, Pt, Pb — each confirmed zero compounds

Next steps

Repeat the analysis for CIDs 1,000,001 through 1,001,000 — a comparable block from the middle of PubChem's history — to quantify how far the element coverage expands as the deposits become more synthetic.
Date-stamp each CID against PubChem's deposit history (some early CIDs do have deposit dates) to establish when fluorine first arrives in the database.
Compute the molecular-weight distribution per composition tier to see whether the metal-containing entries are systematically heavier than CHNO-only entries.
Cross-reference the 24 'other' (metal-containing) compounds in CIDs 1..1000 against their PubChem source records to identify which curator deposited each one.

Artifacts

Element coverage script: discovery/chemistry/early_pubchem.py
PubChem CIDs 1..1000 JSON (pinned): discovery/chemistry/pubchem_1to1000.json