V · Verified Data deep dive

Meaning before metrics; quality before action

Define what enterprise facts mean, organize the vocabulary, model the domain, and prove that decision-critical data is fit for governed agent use.

The problem V solves

Matching columns is not shared understanding

An agent may read technically valid data and still act incorrectly because teams disagree about meaning: whether “delivery date” means promised, planned, attempted, or completed; whether “customer” means payer, receiver, or legal entity; or whether revenue is gross, net, booked, or recognized.

Semantic risk

The same label carries different meaning, units, time basis, granularity, population, or calculation across systems.

Quality risk

The meaning is agreed, but values are incomplete, stale, invalid, inconsistent, duplicated, or unsupported by provenance.

Decision risk

An agent combines facts that should not be compared, trusts a low-quality source, or applies a rule outside its intended context.

Assurance objective

Every decision-critical fact has governed meaning, authoritative source, quality expectations, lineage, owner, permitted use, and enforced failure behavior.

Plain-language definitions

From words to machine-actionable meaning

The correct spelling is taxonomy. These concepts overlap, but they solve different problems.

Business term / definition

An approved description of one concept in business language, including owner, scope, examples, synonyms, exclusions, and rules. Example: On-time delivery means delivered at or before the committed timestamp, adjusted only by approved customer exceptions.

Business glossary

The governed collection of business terms. It answers “What does this mean here?” and connects terms to data products, columns, reports, policies, owners, and controls.

Semantics

The meaning of data in context: concept, units, time basis, granularity, relationships, calculation, permitted interpretation, and intended use. A semantic layer makes those meanings reusable for analytics and AI.

Taxonomy

A controlled classification hierarchy used to organize concepts from broad to specific. Example: Disruption → Transportation disruption → Road closure → Full closure. A taxonomy mainly supports consistent classification and discovery.

Ontology

A formal model of domain concepts, their properties, relationships, constraints, and sometimes inference rules. Example: a Truck carries Shipment; a Shipment serves Customer; hazardous cargo must not use prohibited road classes.

Schema

The technical structure of a dataset or message: fields, types, nullability, keys, and nesting. A schema says how data is shaped; it rarely captures the complete business meaning.

Reference data

Controlled codes and allowed values shared across systems, such as vehicle class, country, closure severity, unit of measure, or shipment status. It operationalizes part of a taxonomy.

Data quality

The degree to which data is fit for a stated use. Quality is not “clean” in the abstract; it is measured against decision-specific rules, thresholds, time windows, and consequences.

Data contract

A versioned agreement between producers and consumers covering structure, semantics, ownership, quality SLOs, compatibility, security, permitted use, and failure behavior.

Lineage and provenance

Lineage shows where data moved and how it changed. Provenance records the origin, authority, time, method, and confidence supporting a particular claim.

Worked logistics example

One concept, five connected artifacts

ArtifactExampleQuestion answered
DefinitionRoad closure: an authoritative restriction that prevents travel on a defined network segment for a stated interval and vehicle population.What exactly does the term mean?
TaxonomyDisruption → Transportation → Road restriction → Partial closure / Full closure; cause → crash / construction / weather / emergency.How do we classify and find it?
Schemaclosure_id, segment_id, status, valid_from, valid_to, vehicle_classes, source, confidence.How is it represented in a record or event?
OntologyClosure affects RoadSegment; RoadSegment belongs to Route; Truck uses Route; CargoClass is prohibited on RoadClass.How does it relate to the domain, and what can be inferred?
Quality rulesID is unique; geometry is valid; source is approved; received within 60 seconds; start precedes end; confidence ≥0.90 for automatic rerouting; vehicle scope is not empty.Can this particular fact safely drive the decision?
Agent-ready context: supply the approved definition, code meaning, units, temporal validity, source authority, quality result, and relevant relationships with the value—not merely the field name and number.

Data quality dimensions

Measure fitness for the decision

DimensionMeaningLogistics rule exampleAction on failure
AccuracyCorrectly represents the real-world state.Vehicle position is within the accepted device error and agrees with the road network.Lower confidence; request another fix or human confirmation.
CompletenessRequired values and populations are present.Every active shipment has truck, destination, committed time, priority, and cargo class.Block automated route execution.
ConsistencyValues do not conflict across sources or rules.TMS shipment status agrees with the latest accepted shipment event.Quarantine; reconcile using source precedence.
ValidityConforms to type, format, range, domain, and business constraints.Coordinates use the declared CRS; closure end is after start; status is an approved code.Reject the record or event.
UniquenessOne real-world fact is not represented as unintended duplicates.One authoritative closure identity per source/version and network segment.Deduplicate before triggering action.
Timeliness / freshnessAvailable soon enough and recent enough for its use.Closure received within 60 seconds; vehicle position no older than 120 seconds.Use safe degradation or route to dispatcher.
IntegrityKeys and relationships remain valid and complete.Every route segment and shipment references an existing governed entity.Prevent publish; repair the relationship.
ProvenanceOrigin, authority, collection method, time, and transformation are known.Closure traces to an approved transportation authority and signed ingestion record.Do not use for autonomous action.

Quality SLO pattern: measure + population + threshold + window + owner + response. Example: “≥99.5% of active shipments have a valid committed timestamp during each 15-minute window; otherwise pause automatic ETA notification and alert Dispatch Data Operations.”

V and K boundary

Meaning begins in V; relationship reasoning deepens in K

V · Verified Data owns

  • Approved terms, definitions, synonyms, codes, units, calculations, granularity, and time basis
  • Data-product schema and semantic contracts
  • Taxonomy and reference-data governance
  • Quality rules, thresholds, SLOs, lineage, provenance, ownership, and permitted use
  • Enough semantics for a consumer to interpret each decision-critical fact correctly

K · Knowledge Graphs owns

  • Formal relationships across entities, events, policies, locations, time, and risk
  • Ontology constraints, competency questions, identity resolution, inference, and graph query behavior
  • Claim-level provenance, temporal truth, confidence, and relationship authorization
  • Connected reasoning that cannot be expressed safely as isolated tables, metrics, or hierarchies

Shared responsibility: the business vocabulary and canonical identifiers must remain aligned. A formal ontology may be governed jointly: V assures definitions and source facts; K assures relationship and inference behavior.

Current platform capability map

Databricks, Snowflake, and Microsoft Fabric

Map the architectural responsibility first, then select platform modules. Product features can support the method; none replaces ownership, definition approval, decision-specific quality rules, or operating evidence.

NeedDatabricksSnowflakeMicrosoft Fabric ecosystemVENKAT interpretation
Catalog, ownership, discovery, access, lineageUnity Catalog: governed objects, permissions, discovery, lineage, tags, and metadata.Horizon Catalog: discovery, governance, lineage, classification, and quality capabilities.OneLake Catalog, Fabric governance, and Microsoft Purview for broader catalog/governance workflows.Implement V ownership, classification, lineage, discovery, access, and evidence indexing.
Business semantics and reusable metricsUnity Catalog metric views define governed measures and dimensions for consistent consumption.Semantic Views model business entities, facts, dimensions, relationships, and metrics.Power BI semantic models define measures, relationships, hierarchies, and business-facing analytical meaning.Use for analytical semantics and metric consistency. Add glossary definitions, owners, scope, grain, time basis, and exclusions.
Glossary, taxonomy, classificationUse governed tags, comments, domains/catalog structure, and external glossary integration where required.Use tags, classifications, object metadata, and Horizon governance; integrate a formal glossary when enterprise term workflows exceed native metadata.Use Purview Unified Catalog business concepts, governance domains, data products, classifications, and glossary capabilities alongside OneLake.Publish approved terms and hierarchical classifications; link them to products, fields, metrics, controls, and owners.
Formal ontology and inferenceMetric views and catalog metadata are not a general formal ontology. Model tables/views or integrate an RDF/property-graph and ontology tool for OWL/SHACL/inference needs.Semantic Views are not automatically a formal ontology. Use relational models or external graph/semantic technology when formal inference is required.Fabric IQ Ontology provides a native ontology item and is currently documented as Preview; evaluate preview limitations and change risk.Do not relabel a semantic model as an ontology. Use formal ontology only when relationships, constraints, interoperability, or inference justify it.
Data quality rules and monitoringLakeflow Declarative Pipelines expectations validate records and define warn, drop, or fail behavior; supplement with profiling and SLO dashboards.Data Metric Functions and data-quality monitoring measure data with system or custom metrics.Microsoft Purview Data Quality provides rules, scans, scores, and governance workflows for registered assets.Implement dimension-specific rules, thresholds, population/window, owner, incident response, and safe degradation.
Data contracts and schema changeUse table constraints/schema controls, pipeline expectations, versioned code, and catalog metadata; record consumer compatibility explicitly.Use table/schema governance, policies, change controls, tags, and contract tests in delivery pipelines.Use lakehouse/warehouse schemas, deployment pipelines, Purview metadata, and producer-consumer contract tests.A platform schema is only part of a contract; add semantics, SLOs, compatibility, security, permitted use, and failure behavior.
Agent consumptionExpose governed views/metric views and catalog metadata through least-privileged retrieval or tools.Expose governed semantic views and curated data products through scoped roles/tools.Expose semantic models, governed OneLake data products, or ontology-backed context through approved Fabric services.Return definition, units, time, source, quality status, lineage/provenance, and policy context with the value.

Selection shortcut: use the platform already closest to the governed source and operating team. Introduce another technology only for a demonstrated semantic, ontology, graph, quality, or operational gap—not because its diagram looks impressive.

Implementation method

How to achieve Verified Data meaning and quality

1

Bound the decision

Name the agent decision, population, material consequence, latency, autonomy, jurisdictions, and failure tolerance. Start with 10–20 decision-critical facts.

2

Inventory meaning

For each fact, record current labels, definitions, source, owner, units, grain, time basis, codes, calculations, consumers, and known disagreement.

3

Agree definitions

Run workshops with business owner, steward, producer, consumer, architect, risk, and domain expert. Approve one definition or document contextual variants.

4

Build taxonomy

Organize terms and reference codes into governed hierarchies. Assign stable identifiers; define synonyms, mappings, valid values, version, and change authority.

5

Model semantics

Define entities, attributes, measures, dimensions, grain, units, time, relationships, and calculation logic in the platform semantic layer.

6

Add ontology where needed

Write competency questions first. Create formal concepts, relationships, constraints, and inference only when the use case needs connected reasoning or interoperability.

7

Contract the product

Version structure and semantics together. Include ownership, classification, quality SLOs, lineage, compatibility, security, retention, permitted use, and incident behavior.

8

Implement quality

Profile the baseline, define decision-specific rules and thresholds, test at ingestion and transformation, and select warn, quarantine, reject, or safe-degrade responses.

9

Publish and observe

Connect glossary, taxonomy, semantic model, physical assets, lineage, dashboards, incidents, and owners in the catalog. Monitor trends and consumer impact.

10

Certify and sustain

Sample facts and decisions, inject failures, reconstruct outcomes, review access and change, remediate findings, and reassess after material semantic or source change.

Definition of done

Evidence that V is operating

Required artifacts

  • Decision-critical data inventory with accountable owners
  • Approved glossary terms and taxonomy/version history
  • Semantic model and, where justified, ontology plus competency questions
  • Versioned contracts linked to physical schemas and consumers
  • Quality rules, SLOs, baseline, run results, incidents, and remediation
  • Lineage/provenance samples, access review, retention, and permitted-use record
  • Change impact and consumer compatibility evidence

Acceptance tests

  • Two independent practitioners interpret sampled facts consistently.
  • Every automated decision input traces to an approved term, source, owner, and contract.
  • Stale, malformed, conflicting, and unauthorized data produces the designed response.
  • Metric results reconcile across approved semantic consumers.
  • Ontology inferences trace to governed claims and constraints.
  • A material definition or schema change triggers impact review and compatible rollout.
  • Quality dashboards lead to owned incidents—not decorative red tiles.