Semantic risk
The same label carries different meaning, units, time basis, granularity, population, or calculation across systems.
V · Verified Data deep dive
Define what enterprise facts mean, organize the vocabulary, model the domain, and prove that decision-critical data is fit for governed agent use.
The problem V solves
An agent may read technically valid data and still act incorrectly because teams disagree about meaning: whether “delivery date” means promised, planned, attempted, or completed; whether “customer” means payer, receiver, or legal entity; or whether revenue is gross, net, booked, or recognized.
The same label carries different meaning, units, time basis, granularity, population, or calculation across systems.
The meaning is agreed, but values are incomplete, stale, invalid, inconsistent, duplicated, or unsupported by provenance.
An agent combines facts that should not be compared, trusts a low-quality source, or applies a rule outside its intended context.
Every decision-critical fact has governed meaning, authoritative source, quality expectations, lineage, owner, permitted use, and enforced failure behavior.
Plain-language definitions
The correct spelling is taxonomy. These concepts overlap, but they solve different problems.
An approved description of one concept in business language, including owner, scope, examples, synonyms, exclusions, and rules. Example: On-time delivery means delivered at or before the committed timestamp, adjusted only by approved customer exceptions.
The governed collection of business terms. It answers “What does this mean here?” and connects terms to data products, columns, reports, policies, owners, and controls.
The meaning of data in context: concept, units, time basis, granularity, relationships, calculation, permitted interpretation, and intended use. A semantic layer makes those meanings reusable for analytics and AI.
A controlled classification hierarchy used to organize concepts from broad to specific. Example: Disruption → Transportation disruption → Road closure → Full closure. A taxonomy mainly supports consistent classification and discovery.
A formal model of domain concepts, their properties, relationships, constraints, and sometimes inference rules. Example: a Truck carries Shipment; a Shipment serves Customer; hazardous cargo must not use prohibited road classes.
The technical structure of a dataset or message: fields, types, nullability, keys, and nesting. A schema says how data is shaped; it rarely captures the complete business meaning.
Controlled codes and allowed values shared across systems, such as vehicle class, country, closure severity, unit of measure, or shipment status. It operationalizes part of a taxonomy.
The degree to which data is fit for a stated use. Quality is not “clean” in the abstract; it is measured against decision-specific rules, thresholds, time windows, and consequences.
A versioned agreement between producers and consumers covering structure, semantics, ownership, quality SLOs, compatibility, security, permitted use, and failure behavior.
Lineage shows where data moved and how it changed. Provenance records the origin, authority, time, method, and confidence supporting a particular claim.
Worked logistics example
| Artifact | Example | Question answered |
|---|---|---|
| Definition | Road closure: an authoritative restriction that prevents travel on a defined network segment for a stated interval and vehicle population. | What exactly does the term mean? |
| Taxonomy | Disruption → Transportation → Road restriction → Partial closure / Full closure; cause → crash / construction / weather / emergency. | How do we classify and find it? |
| Schema | closure_id, segment_id, status, valid_from, valid_to, vehicle_classes, source, confidence. | How is it represented in a record or event? |
| Ontology | Closure affects RoadSegment; RoadSegment belongs to Route; Truck uses Route; CargoClass is prohibited on RoadClass. | How does it relate to the domain, and what can be inferred? |
| Quality rules | ID is unique; geometry is valid; source is approved; received within 60 seconds; start precedes end; confidence ≥0.90 for automatic rerouting; vehicle scope is not empty. | Can this particular fact safely drive the decision? |
Data quality dimensions
| Dimension | Meaning | Logistics rule example | Action on failure |
|---|---|---|---|
| Accuracy | Correctly represents the real-world state. | Vehicle position is within the accepted device error and agrees with the road network. | Lower confidence; request another fix or human confirmation. |
| Completeness | Required values and populations are present. | Every active shipment has truck, destination, committed time, priority, and cargo class. | Block automated route execution. |
| Consistency | Values do not conflict across sources or rules. | TMS shipment status agrees with the latest accepted shipment event. | Quarantine; reconcile using source precedence. |
| Validity | Conforms to type, format, range, domain, and business constraints. | Coordinates use the declared CRS; closure end is after start; status is an approved code. | Reject the record or event. |
| Uniqueness | One real-world fact is not represented as unintended duplicates. | One authoritative closure identity per source/version and network segment. | Deduplicate before triggering action. |
| Timeliness / freshness | Available soon enough and recent enough for its use. | Closure received within 60 seconds; vehicle position no older than 120 seconds. | Use safe degradation or route to dispatcher. |
| Integrity | Keys and relationships remain valid and complete. | Every route segment and shipment references an existing governed entity. | Prevent publish; repair the relationship. |
| Provenance | Origin, authority, collection method, time, and transformation are known. | Closure traces to an approved transportation authority and signed ingestion record. | Do not use for autonomous action. |
Quality SLO pattern: measure + population + threshold + window + owner + response. Example: “≥99.5% of active shipments have a valid committed timestamp during each 15-minute window; otherwise pause automatic ETA notification and alert Dispatch Data Operations.”
V and K boundary
Shared responsibility: the business vocabulary and canonical identifiers must remain aligned. A formal ontology may be governed jointly: V assures definitions and source facts; K assures relationship and inference behavior.
Current platform capability map
Map the architectural responsibility first, then select platform modules. Product features can support the method; none replaces ownership, definition approval, decision-specific quality rules, or operating evidence.
| Need | Databricks | Snowflake | Microsoft Fabric ecosystem | VENKAT interpretation |
|---|---|---|---|---|
| Catalog, ownership, discovery, access, lineage | Unity Catalog: governed objects, permissions, discovery, lineage, tags, and metadata. | Horizon Catalog: discovery, governance, lineage, classification, and quality capabilities. | OneLake Catalog, Fabric governance, and Microsoft Purview for broader catalog/governance workflows. | Implement V ownership, classification, lineage, discovery, access, and evidence indexing. |
| Business semantics and reusable metrics | Unity Catalog metric views define governed measures and dimensions for consistent consumption. | Semantic Views model business entities, facts, dimensions, relationships, and metrics. | Power BI semantic models define measures, relationships, hierarchies, and business-facing analytical meaning. | Use for analytical semantics and metric consistency. Add glossary definitions, owners, scope, grain, time basis, and exclusions. |
| Glossary, taxonomy, classification | Use governed tags, comments, domains/catalog structure, and external glossary integration where required. | Use tags, classifications, object metadata, and Horizon governance; integrate a formal glossary when enterprise term workflows exceed native metadata. | Use Purview Unified Catalog business concepts, governance domains, data products, classifications, and glossary capabilities alongside OneLake. | Publish approved terms and hierarchical classifications; link them to products, fields, metrics, controls, and owners. |
| Formal ontology and inference | Metric views and catalog metadata are not a general formal ontology. Model tables/views or integrate an RDF/property-graph and ontology tool for OWL/SHACL/inference needs. | Semantic Views are not automatically a formal ontology. Use relational models or external graph/semantic technology when formal inference is required. | Fabric IQ Ontology provides a native ontology item and is currently documented as Preview; evaluate preview limitations and change risk. | Do not relabel a semantic model as an ontology. Use formal ontology only when relationships, constraints, interoperability, or inference justify it. |
| Data quality rules and monitoring | Lakeflow Declarative Pipelines expectations validate records and define warn, drop, or fail behavior; supplement with profiling and SLO dashboards. | Data Metric Functions and data-quality monitoring measure data with system or custom metrics. | Microsoft Purview Data Quality provides rules, scans, scores, and governance workflows for registered assets. | Implement dimension-specific rules, thresholds, population/window, owner, incident response, and safe degradation. |
| Data contracts and schema change | Use table constraints/schema controls, pipeline expectations, versioned code, and catalog metadata; record consumer compatibility explicitly. | Use table/schema governance, policies, change controls, tags, and contract tests in delivery pipelines. | Use lakehouse/warehouse schemas, deployment pipelines, Purview metadata, and producer-consumer contract tests. | A platform schema is only part of a contract; add semantics, SLOs, compatibility, security, permitted use, and failure behavior. |
| Agent consumption | Expose governed views/metric views and catalog metadata through least-privileged retrieval or tools. | Expose governed semantic views and curated data products through scoped roles/tools. | Expose semantic models, governed OneLake data products, or ontology-backed context through approved Fabric services. | Return definition, units, time, source, quality status, lineage/provenance, and policy context with the value. |
Selection shortcut: use the platform already closest to the governed source and operating team. Introduce another technology only for a demonstrated semantic, ontology, graph, quality, or operational gap—not because its diagram looks impressive.
Implementation method
Name the agent decision, population, material consequence, latency, autonomy, jurisdictions, and failure tolerance. Start with 10–20 decision-critical facts.
For each fact, record current labels, definitions, source, owner, units, grain, time basis, codes, calculations, consumers, and known disagreement.
Run workshops with business owner, steward, producer, consumer, architect, risk, and domain expert. Approve one definition or document contextual variants.
Organize terms and reference codes into governed hierarchies. Assign stable identifiers; define synonyms, mappings, valid values, version, and change authority.
Define entities, attributes, measures, dimensions, grain, units, time, relationships, and calculation logic in the platform semantic layer.
Write competency questions first. Create formal concepts, relationships, constraints, and inference only when the use case needs connected reasoning or interoperability.
Version structure and semantics together. Include ownership, classification, quality SLOs, lineage, compatibility, security, retention, permitted use, and incident behavior.
Profile the baseline, define decision-specific rules and thresholds, test at ingestion and transformation, and select warn, quarantine, reject, or safe-degrade responses.
Connect glossary, taxonomy, semantic model, physical assets, lineage, dashboards, incidents, and owners in the catalog. Monitor trends and consumer impact.
Sample facts and decisions, inject failures, reconstruct outcomes, review access and change, remediate findings, and reassess after material semantic or source change.
Definition of done
Assess V controlsReturn to study modulesOpen evidence template