Global Knowledge for AI
A Database-First Approach
By G. Sawatzky, embedded-commerce.com
August 27, 2025
Introduction
The Semantic Web envisioned intelligent machines understanding globally interconnected data. Two decades later, while this vision remains compelling, its web-document-centric foundations have faced significant limitations for modern AI needs. This article explores why that paradigm can create fundamental problems for structured data and proposes a database-first approach. This method aims to maintain clean architectural separation, potentially delivering better performance, and providing the logical rigor that may be necessary for reliable, knowledge-aware AI by leveraging modern public APIs like GraphQL.
The Problem: Web Documents Are Not Databases
Within the dynamic landscape of business, commerce, and enterprise AI, the need for robust and scalable knowledge management systems is essential. Tim Berners-Lee's vision for the Semantic Web grew naturally from his success with the World Wide Web. Technologies like HTTP, URIs, and hyperlinks solved the problem of linking documents across a distributed network. However, extending this document-centric paradigm to structured data with RDF and triple stores appears to have introduced architectural problems that persist today, particularly for the demands of enterprise-scale data.
A Flawed Data Model
The web's document model treats everything as markup, mixing structure and content. While this is effective for human-readable pages, it can lead to significant issues for structured data, potentially resulting in a loss of proven architectural discipline.
- Collapsed Abstraction Layers: Proven database systems maintain a clear separation between physical storage, logical schemas, and presentation layers. RDF often flattens these into "triples everywhere," potentially abandoning decades of architectural wisdom. This can conflate conceptual, logical, and physical layers.
- Schema-Instance Confusion: In RDF, ontology definitions and data instances look identical. While this offers representational convenience, it may blur the line between data structure and content, making both more difficult to manage, optimize, and evolve.
- Performance Penalties: Web-style loose coupling prioritizes eventual consistency and distributed linking. Enterprise systems, however, typically require data independence, which allows for changes to the physical storage or logical schema without disrupting application programs. Benchmark studies suggest triple stores often struggle with large datasets and can suffer from significant performance variations for querying only for certain query types. This seems to align with the long-standing criticisms of "one size does not fit all" database approaches by experts like Michael Stonebraker.
The Higher-Arity Problem
Real-world relationships are not always simple pairs. Conceptual modeling methods like Object-Role Modeling (ORM) naturally handle n-ary fact types involving multiple entities. Triple stores, by contrast, are often implemented as fundamentally binary. Modeling a relationship like "Person ordered Product from Supplier on Date" can become an awkward collection of triples with an artificial "Order" node. This reification process may obscure the original semantics and can lead to a loss of the constraints that ORM models naturally express.
The Solution: A Database-First Approach
Instead of starting with web architecture and adapting it for structured data, this article proposes building on the proven principles of industrial-strength database management systems and adding intelligent public interfaces on top. This approach aims to leverage the principles of foundational database theory to build a robust framework for knowledge-aware AI.
Object-Role Modeling: A Practical Semantic Layer
For practical use in today's neuro-symbolic AI, an ontology might be seen as more than a theoretical concept. It could be viewed as a structured, interpretable specification of a domain expressed through logic-governed constraints and formal semantics. An Object-Role Model, when developed rigorously, may serve as a powerful, machine-interpretable ontology that fully meets this definition, effectively forming the semantic layer of the knowledge architecture.
Object-Role Modeling is a conceptual methodology that uses a role-based approach to prioritize constraints and conceptual abstraction. It focuses on defining the world independent of any specific implementation and provides a precise semantic blueprint that various systems can implement. Its utility for explicit business rule modeling and robust enterprise information architecture is particularly noteworthy. This approach is an effective tool for building a practical semantic framework, which is explored in more detail in my other articles.
GraphQL as a Public-Facing Knowledge Interface
Once a solid, logical foundation, precisely defined by an ORM model and forming the semantic layer, is established, one can then consider exposing that knowledge through a flexible, modern API. GraphQL presents itself as a highly suitable public interface for this purpose, acting as the gateway to the underlying semantics.
- Precision and Efficiency: Unlike traditional REST APIs, GraphQL allows a client to specify exactly what data they need, potentially reducing over-fetching and under-fetching. This can be particularly valuable when exposing complex, interconnected data models to a wide range of external applications.
- Semantic Interoperability at Scale: ORM's rigorous approach to defining conceptual roles and logic-governed constraints can provide the precise, shared semantics necessary for true interoperability. When this detailed conceptual model is exposed via a strongly typed GraphQL schema, it can create a common language for data exchange across disparate systems and organizations. This may allow AI systems and human users to understand and query data not just structurally, but meaningfully, potentially enabling global knowledge sharing at a significant scale. This approach aligns with current trends toward modular, federated ontological architectures, where diverse data components within an enterprise can maintain autonomy while contributing to a unified, queryable knowledge graph.
- Intuitive Discovery: GraphQL APIs are self-documenting through their schema, which describes all possible data types and relationships. This can make it easier for developers and, more importantly, for AI systems and automated agents to discover and query the knowledge base without prior knowledge of the internal data structure.
- Strong Typing: The GraphQL schema provides a strong, explicit contract for the data. This strong typing can help prevent errors and potentially ensures consistency, which is critical for reliable AI applications.
- Practical Identity Management and IRI Resolution: GraphQL's federation model offers an elegant solution to the global identifier problem that often challenged the Semantic Web's reliance on HTTP URIs. Organizations could use optimal internal identifiers (like auto-incrementing integers or UUIDs) while the federation layer handles global uniqueness through namespace prefixing. For example, a customer with internal ID
100
in a local database might be represented externally as http://customer-service.embedded-commerce.com/customers/100
, without requiring internal database modifications. This approach aims to provide internal efficiency and external consistency without sacrificing backward compatibility or evolution flexibility.
- Cross-Service Joins and Query Distribution: GraphQL federation excels at handling complex queries that span multiple services, a common challenge for distributed knowledge. The federation gateway intelligently plans queries across various backend systems, batches requests to minimize network overhead, and can stream results as they become available.
Making Knowledge Discoverable
For a knowledge architecture to be truly valuable, its underlying semantics must be easily discoverable and consumable at scale, especially by autonomous agents. This goes beyond simple schema introspection and requires a dedicated strategy for discovery. Several practical approaches, much lighter and more effective than traditional Semantic Web crawling, are now possible:
- Schema Registries: A centralized but lightweight registry can act as a knowledge hub. Organizations can publish their GraphQL schema metadata (such as schema fingerprints, basic domain categories, endpoint URLs, and access patterns) to these registries. LLMs or other agents can then query this registry to find and analyze schemas, identifying clusters of similar data models and business domains.
- DNS-based Discovery: Using existing DNS infrastructure, schemas can be discovered by publishing a standard DNS TXT record or using a consistent subdomain pattern (e.g.,
graphql-schema.domain.com
) that returns the schema metadata. This allows discovery to piggyback on established, resilient internet infrastructure.
- Crawling and Introspection: Automated crawlers can discover GraphQL endpoints by looking for common URL patterns or by following simple "semantic beacons" in HTML markup (e.g., a
<link rel="graphql-schema">
tag). Once an endpoint is found, an introspection query can be used to pull the full schema, which an AI can then process to understand the available data and its structure.
- Family Resemblances: A practical approach to interoperability is to use the concept of "family resemblance" to organize different yet related knowledge domains. This allows for interoperability by identifying commonalities without needing a single, monolithic, universal schema. (This concept is explored in a separate article: Family Resemblances: A Solution for Knowledge Interoperability).
The New Context: The Role of AI
The rise of Large Language Models (LLMs) has undeniably changed the knowledge representation landscape. Instead of machines reading semantic markup, systems now appear able to understand and reason over natural language at an unprecedented scale.
The future may lie in combining LLM's natural language capabilities with the formal logical reasoning of a clean knowledge representation. This hybrid intelligence could leverage the strengths of both neural and symbolic systems. A well-structured, ORM-based database might become the ideal foundation for knowledge-grounded AI by:
- Potentially Reducing Hallucinations: LLMs could query the structured knowledge base for factual verification.
- Enabling Citations: The structured data may allow LLMs to cite specific sources and track information provenance, potentially addressing concerns about the reliability of AI-generated content.
- Providing Domain Expertise: High-quality, domain-specific knowledge representations could provide LLMs with expert-level knowledge where training data may be limited.
Conclusion
The Semantic Web's vision of globally accessible knowledge is still worth pursuing, but its architectural foundation appears to have been flawed. The article suggests that the solution may not lie in extending the web's document model to data, but rather in starting with proven database principles. By building intelligent public interfaces like GraphQL on top of a solid, logical foundation provided by industrial-strength database management systems and Object-Role Modeling (acting as the semantic layer), this approach aims to deliver the performance, reliability, and logical rigor that both enterprise systems and public knowledge require. The Object-Role Model itself provides the precise conceptual blueprint, independent of any specific implementation, potentially ensuring that the underlying semantics are clear and consistent, regardless of the chosen database or reasoning engine. This is a challenging and complex area to solve, and while this article does not seek to address every possible challenge, it is posited that this direction is worthy of more research and exploration.
The future of knowledge-aware AI could involve building proper data architectures with intelligent interfaces, potentially fulfilling the original vision more completely and making knowledge truly accessible to humans, AI systems, and automated agents.
Other Sources for Further Research:
- Stonebraker, M., & Pavlo, A. (2024). What Goes Around Comes Around... And Around.... SIGMOD Record, 53(1).
- Byron, L., Schrock, N., & Schafer, D. (2015). GraphQL: A data query language. (Often referenced from early presentations/blog posts when GraphQL was open-sourced by Facebook).
- Apollo GraphQL Documentation & Blog. (Various articles on GraphQL Federation, schema design, and enterprise adoption. A good starting point for exploring practical GraphQL implementations at scale).
- Sequeda, J. (Various publications and talks on knowledge graphs, data integration, and the practical application of semantic technologies in modern data ecosystems). A leading researcher in knowledge graphs and database-to-ontology mapping, Sequeda's work on enterprise knowledge graph construction from relational databases provides foundational methodologies for database-first knowledge representation. While his approach typically leverages W3C standards (RDF/SPARQL), his emphasis on practical enterprise implementation and relational database integration shares common ground with the database-first philosophy advocated here, even when exploring alternative interface technologies.
- Sawatzky, G. (2025). Knowledge Engineering and the 'Shortcomings' of SQL. https://www.embedded-commerce.com/ke_sql.html
- Sawatzky, G. (2025). An ORM-Based Semantic Framework. https://www.embedded-commerce.com/An%20ORM-Based%20Semantic%20Framework.html
- Sawatzky, G. (2025). Is an Object-Role Model an Ontology? A Practical Guide. https://www.embedded-commerce.com/Is_an_Object-Role_Model_an_Ontology.html
- Sawatzky, G. (2025). Family Resemblances: A Solution for Knowledge Interoperability. https://www.embedded-commerce.com/family-resemblance.html