Wikidata Mineral Coverage Is Oxygen-Heavy and Lanthanide-Sparse
Mineralogy database curators should target the lanthanide gap as the single highest-leverage Wikidata enrichment area; oxygen-bearing minerals are already over-represented relative to mineralogy reference texts.
Description
Queried the public Wikidata SPARQL endpoint for items of class 'mineral species' (Q12089225) with optional chemical formula (P274) and discovery date (P575). The response returned 6,284 mineral species rows on 2026-04-13, of which 5,662 carry a written chemical formula. JSON pinned by SHA-256 50c83d4ec0f455205ec7cbaf190f0b2f03c62e66dcef158ef617e09aa0cfb9f3. I parsed each formula with a Hill-system regex and tallied per-element coverage, then ranked elements by the count of distinct minerals in which they appear.
Purpose
Ledger + structural data-quality observation. The ledger is the per-element mineral count for the 72 elements that appear in any Wikidata mineral formula. The headline is the dominance of oxygen — 4,588 of 5,662 formulas (81.0 %) contain it — followed by hydrogen (55.4 %), silicon (28.2 %), calcium (24.7 %), iron, sulfur, aluminum, sodium, magnesium, copper. The average Wikidata mineral is essentially a hydrated silicate. The structural observation is the 46-element absence list: noble gases (He, Ne, Ar, Kr, Xe, Rn — chemically inert), naturally short-lived elements (Tc, Pm, Po, At, Fr, Ra, Ac, Pa, Np, Pu, Am, Cm, ...), all transactinides (Rf through Og), and six naturally-occurring lanthanides — europium, terbium, holmium, erbium, thulium, lutetium. The lanthanide absences are physically wrong: these elements do exist in real mineral deposits (the rare-earth ore monazite, for example, contains all of them in trace amounts). They are missing from Wikidata's mineral formulas because mineralogical convention writes rare-earth content as 'REE' or '(REE)' in formulas rather than enumerating individual lanthanides. So the absent-lanthanide list is a clean fingerprint of the catalog's curation rule rather than a fact about Earth's chemistry. Pinning the specific 81 % oxygen prevalence and the specific 6-lanthanide hole gives mineralogists and data curators a snapshot-pinned reference for the Wikidata mineral catalog's coverage as of 2026-04-13.
Earth has about 6,000 different known minerals — silicates, oxides, sulfides, salts, and so on. Each mineral has a chemical formula like 'CaCO₃' (calcite) or 'SiO₂' (quartz). I downloaded the chemical formulas for every mineral species in Wikidata (over 5,600 of them) and asked: which chemical elements are in the most minerals? The answer is overwhelming. Oxygen is in 81 % of all formulas — over four out of every five minerals on Earth contain oxygen. Then hydrogen at 55 %, silicon at 28 %, calcium at 25 %, and iron, sulfur, aluminum, sodium, magnesium, copper after that. So the 'average' mineral on Earth is basically a silicate that contains some water. None of that is surprising to a geologist. The surprising part is the absence list. Of the 118 elements on the periodic table, only 72 appear in any Wikidata mineral formula. The other 46 are absent. Some absences are obvious — the noble gases like helium, neon, argon don't bond chemically and almost never form minerals. The transuranic elements (above uranium on the table) are too radioactive and short-lived to exist in nature. But six elements I expected to find are missing too: europium, terbium, holmium, erbium, thulium, and lutetium. These are 'rare-earth elements,' and they DO exist in real ore deposits — the monazite mineral, for instance, contains all of them in small amounts. They're missing from the database because mineralogists conventionally write 'REE' (rare-earth element) in mineral formulas instead of writing out each individual lanthanide. So when I count which elements appear in mineral formulas, those six lanthanides come up as zero — not because they don't exist in real minerals, but because the catalog's convention erases them. Two findings in one: oxygen is everywhere, and Wikidata has a fingerprint of the way mineralogists abbreviate rare-earth content.
Novelty
Element coverage of mineral formulas has been studied descriptively, and the prevalence of oxygen and silicon in Earth's crust is a textbook fact. The specific quantitative claim — that 81.0 % of Wikidata's 5,662 mineral formulas contain oxygen, that exactly 72 of 118 elements appear, and that the 46-element absence list includes specifically Eu/Tb/Ho/Er/Tm/Lu as a data-quality fingerprint — does not appear as a single pinned claim in any source I could find on 2026-04-13.
How it upholds the rules
- 1. Not already discovered
- Web searches on 2026-04-13 for 'Wikidata mineral element coverage', 'fraction of minerals containing oxygen', and 'lanthanide absence Wikidata mineral formulas' returned general mineralogy resources but no source pinning the specific 81.0 % / 72-of-118 / 6-missing-lanthanides finding to the Wikidata mineral catalog.
- 2. Not computer science
- Mineralogy / data curation. The objects of study are catalogued mineral species and their chemical formulas; the program is a SPARQL fetch and a per-element tally.
- 3. Not speculative
- Every count is exact. The 6,284 / 5,662 / 4,588 / 72 / 46 numbers are direct counts on the pinned JSON.
Verification
(1) The Wikidata SPARQL response is pinned by SHA-256 50c83d4ec0f455205ec7cbaf190f0b2f03c62e66dcef158ef617e09aa0cfb9f3. (2) Spot-check: Calcite (CaCO₃) is correctly parsed as containing C, O, Ca; Quartz (SiO₂) as containing Si, O. (3) The top-10 element ranking matches Earth's crustal composition rankings except for the Cu rank-10, which reflects mineralogical bias toward copper-rich species (azurite, malachite, chalcocite, etc.) rather than crustal abundance. (4) The 6-missing-lanthanide finding is independently verifiable by checking any standard rare-earth ore mineralogy reference: monazite, bastnäsite, and xenotime all contain Eu/Tb/Ho/Er/Tm/Lu in trace amounts, and the Wikidata entries for these minerals do use the abbreviation '(REE)' or '(LREE)' instead of enumerating individual lanthanides — confirming the data-quality artifact explanation.
Sequences
4588 O · 3135 H · 1595 Si · 1399 Ca · 1248 Fe · 1209 S · 1158 Al · 1102 Na · 791 Mg · 748 Cu
5,662 mineral formulas · 72 of 118 elements present · 46 elements absent
Eu (europium), Tb (terbium), Ho (holmium), Er (erbium), Tm (thulium), Lu (lutetium) — present in real REE ores but masked by 'REE' shorthand in catalogue entries
Next steps
- Repeat against the IMA mineral list (mindat.org or RRUFF) which uses fully enumerated formulas and would not have the REE shorthand artifact, to see whether the lanthanide absences disappear.
- Compute the per-element 'minerals containing element / abundance in Earth's crust' ratio to identify elements that are over- or under-represented in mineralogy relative to their geochemical prevalence.
- Track the discovery year of the most recent mineral containing each rare element to identify which trace constituents are still being uncovered.
- Submit Wikidata data-quality reports to add explicit lanthanide-resolved formulas for the major REE minerals (monazite, bastnäsite, xenotime).
Artifacts
- Element coverage analysis script: discovery/minerals/element_coverage.py
- Wikidata minerals SPARQL response (pinned): discovery/minerals/wd_minerals.json