The Watch
Datamining

Datamining.

Structured extraction at scale. Rosters, timelines, and datasets built to a declared schema — feeding the other engines or standing alone.

TYPICAL CADENCE
One-shot; refresh on-request
TYPICAL SOURCE COUNT
Variable; schema-driven
TYPICAL OUTPUT LENGTH
Structured data, not prose
TYPICAL CONFIDENCE REGISTER
Coverage-weighted; per-field reliability

// QUESTION CLASS

Produce the list, the roster, the dataset.

Datamining is how The Watch turns unstructured source material into structured data. When the question is "who are all the people in category X?" or "what were all the events that happened in period Y?" or "give me the dataset of Z with these fields," Datamining is the engine that produces it. This is not a report product in the narrative sense; the output is a table, a roster, a timeline, a graph file. It is consumed programmatically as often as it is read. Datamining also feeds the other engines. Foundational baselines rely on it to extract structural facts. Anticipatory forecasts rely on it for indicator data. Actionable targeting relies on it for node inventories. When the output of another engine lists specific entities, specific dates, or specific numbers, Datamining produced that inventory.

Datamining

"Produce a roster of every individual named in the public record as holding [type of role] in [entity] over the past ten years. Include tenure, predecessor, and cited source per row."

Datamining

"Extract every documented [type of event] in [region] between [date] and [date]. Fields: date, location, actors, source."

Datamining

"Build a structured dataset of [subject class] with the following schema: [fields]. Source coverage target: 95% of named entities in the universe."

Datamining

"Compile a timeline of [individual]'s public statements on [topic] with full quotation and attribution."

// ANATOMY

A Datamining output contains:

A Datamining output is a dataset with a cover document. The dataset itself is the deliverable — a table, a graph, a timeline — in whatever format the customer consumes. The cover document explains how it was built: what the schema was, what queries ran, what sources were scanned, what was rejected by quality gates, and what the analyst's confidence is in each field of the schema. A reader who wants to use the dataset for downstream analysis needs to know what they can trust in it; the cover document is that instrument panel.

§ 01
Schema
The field definitions, declared before the run
§ 02
Query Pack
The specific queries that drove extraction
§ 03
Source Corpus
What was scanned, classified by source type
§ 04
Coverage Assessment
Percentage of known universe captured; known gaps
§ 05
Structured Output
The actual data, rendered as table / graph / timeline
§ 06
Per-Field Reliability
Confidence rating per column of the schema
§ 07
Quality Gates
What was rejected and why
§ 08
Refresh Plan
How this dataset stays current

// STRUCTURED OUTPUT

Data, with a source on every cell.

STRUCTURED ROSTER TABLE
// VISUAL · PHASE 2B

// WORKED EXAMPLE

From question to brief.

// THE KIQ

"Produce a roster of every individual publicly documented as holding [specific named role] in [specific entity] between [year] and [year]. Schema: Name, Tenure Start, Tenure End, Predecessor, Successor, Source. Coverage target: 100% of named individuals."

// PIPELINE TRACE

// INTAKE
PIOT complete. Schema declared: 6 fields
Universe definition: [role] in [entity], [year range]
Coverage target: 100%

// QUERY PACK BUILDER
Queries constructed: 14
  · Direct searches (role + entity + year): 4
  · Adjacent searches (predecessor/successor chains): 4
  · Historical references in secondary literature: 3
  · Cross-references to allied roles in partner entities: 3

// RUN
Sources scanned: 2,847
Sources incorporated: 283
Entities extracted: 41 candidate individuals
Quality gates applied: 6 rejected (insufficient sourcing), 3 merged (variants of same individual)

// OUTPUT
Final roster: 32 individuals, 6 fields per row, 192 cell-level citations.
Coverage: 97% (one known gap: [specific year], where public record is thin).
// SAMPLE OUTPUT EXCERPT

The roster contains 32 individuals across a 40-year window, reflecting 97% coverage of the target universe. The single known gap is the interval [specific year — specific year], during which the public record is thin due to [reason]. We flag the two rows adjacent to this gap (entries 11 and 12) as LOW-confidence on the Predecessor and Successor fields respectively. Every other row carries HIGH confidence across all six fields. Per-field reliability: Name (HIGH across all rows), Tenure dates (HIGH for 28 rows, MODERATE for 4), Predecessor/Successor (HIGH for 30, LOW for 2, as noted)…

// TRADECRAFT

Structured data inherits the problems of its sources. The cover document is how we show our work.

Schema first.

The fields are declared before the extraction runs. A Datamining product is only as rigorous as its schema — an ill-defined field produces unusable data. The Intake step pressure-tests the schema against the source universe before the run: do the sources actually support these field definitions, at this coverage target, within this time bound? If not, the schema is revised before anything runs.

Structured Extraction · Schema-first methodology

Per-cell citation.

Every cell of the output table carries the source it came from. Not per-row, not per-entity — per-cell. This is how a downstream analyst knows they can use the data: they can verify any individual datum against the specific source that produced it. Structured data without cell-level sourcing is not auditable data, and we don't produce it.

ICD 206 · Cell-level sourcing

Known gaps are part of the output.

A Datamining product names the gaps in its own coverage. Rows that could not be filled confidently are flagged, not omitted. A "complete-looking" dataset that quietly dropped the 3% of entities with thin sourcing is a less honest dataset than one that includes those rows with LOW-confidence flags. The customer can filter; they cannot unfilter what was never shown.

ICD 203 · Standard 2: Properly expresses uncertainties

Feeds all other engines. Often invoked as the first step before a Foundational or Anticipatory run.

// AUTHORED BY

Jesse R. Wilson
FORMER DIA · 20 YEARS · STRATEGIC INTELLIGENCE

// OTHER ENGINES

See this engine run a real question.