Big Data is often unstructured and complex, making traditional database models insufficient. Fact-based modelling and graph schemas offer scalable, adaptable ways to represent and analyse such data effectively.
Fact-based data modelling
Fact-based data modelling is a method of representing data where each individual fact expresses a single, atomic unit of information. This modelling technique focuses on simplicity, flexibility, and clarity, especially when data is too vast and complex for traditional structures like tables or spreadsheets.
What is a fact?
A fact is the smallest statement of truth within a data model. It captures exactly one piece of information, such as an attribute of an entity or a relationship between two entities.
Example 1:
"Alice owns Car123"
This fact links two entities (Alice and Car123) with a relationship ("owns").Example 2:
"Car123 is red"
This fact represents an attribute of an entity.
Facts are often represented in the following format:
Entity – Relationship – Entity
or
Entity – Attribute – Value
This atomic structure ensures that each fact is easy to understand, update, and query.
Key characteristics of fact-based models
Fact-based modelling provides several benefits over traditional relational models, especially for unstructured data:
Practice Questions
FAQ
Graph schemas are typically stored in specialised database systems called graph databases, with Neo4j, Amazon Neptune, and OrientDB being popular examples. These databases store data as nodes, edges, and properties directly, rather than mapping them to tables as in relational databases. Internally, they often use index-free adjacency, where each node stores direct references to its connected nodes, allowing efficient traversal without the need for joins. Queries are executed using graph-specific query languages like Cypher (used in Neo4j) or Gremlin. These languages enable pattern-based querying, where the user specifies the shape of the relationships they are interested in. For example, to find all people who purchased the same product, the query would follow Person → purchased → Product ← purchased ← Person. This structure allows complex relationship-driven questions to be answered efficiently, even with vast datasets. These systems also support transaction management, ACID compliance, and horizontal scaling for distributed environments.
Yes, graph schemas can effectively model temporal data by incorporating timestamps and time-related attributes directly into nodes and edges. There are two main strategies. First, properties on edges or nodes can include fields like createdAt, updatedAt, or expiresOn to indicate time-specific metadata. Second, temporal nodes can be created to represent moments or durations—these are then connected to other nodes via time-related relationships like occurredOn or validFrom. For instance, a transaction node might link to a date node via an executedOn edge. This makes it possible to track changes over time, analyse time-dependent behaviours, or model version histories. Some advanced graph systems also support temporal graph features, allowing historical queries—such as identifying the state of the graph at a given point in time. This is particularly valuable for financial systems, monitoring logs, or any scenario where understanding how data evolves is critical. Temporal modelling enhances both analytical capabilities and data accuracy.
While graph schemas offer flexibility and performance for connected data, they do come with limitations. One key challenge is the lack of standardisation—different graph databases use different query languages and storage models, making migration or integration more difficult. Additionally, schema governance becomes harder as the graph grows, especially in schema-optional systems where structure is not enforced. This can lead to inconsistent naming conventions or data types across nodes and edges. Another limitation is that graph databases are generally optimised for relationship queries, but may not perform as well as relational databases for purely tabular or aggregate-heavy workloads, like summing large amounts of numerical data. Furthermore, distributed graph processing is complex due to the interconnected nature of data—partitioning a graph without breaking important relationships is a non-trivial task. Lastly, developer expertise is often limited, and building effective graph queries requires a different mindset than traditional SQL. These challenges must be weighed against the model’s benefits.
Graph schemas are ideal for integrating data from multiple heterogeneous sources due to their schema flexibility and relationship-centric design. When data from various systems—such as CRM tools, social media platforms, or sensor networks—needs to be combined, the graph model allows each source to contribute entities and relationships without enforcing a rigid, pre-defined structure. Nodes representing entities from different domains can be linked using meaningful relationships, and new properties or node types can be introduced dynamically without affecting the existing schema. For example, customer data from a CRM system and interaction logs from a website can be merged by linking customer nodes to activity nodes like "ClickedAd" or "SubmittedForm". Data can be incrementally ingested and transformed using ETL (Extract, Transform, Load) pipelines or data mapping tools that identify equivalent entities across datasets. Because graphs can store and query semi-structured and unstructured data together, they offer a seamless framework for semantic integration, enabling more comprehensive and intelligent analytics.
Maintaining data consistency in distributed graph systems is complex but achievable through a combination of replication, partitioning strategies, and consistency models. Graph databases often adopt the CAP theorem trade-offs, choosing between consistency, availability, and partition tolerance depending on the use case. Systems like Neo4j offer ACID compliance for single-instance deployments, ensuring strong consistency. In distributed deployments, eventual consistency may be used, meaning all replicas converge over time. To manage updates across distributed nodes, graph databases implement synchronisation protocols and conflict resolution rules. Some systems partition the graph using community detection or sharding algorithms, aiming to keep related nodes together and reduce cross-partition communication. This helps preserve relationship integrity while still allowing for horizontal scalability. Write operations are often routed through a master node or coordinated via consensus algorithms like Raft to avoid conflicts. Logging and change-tracking mechanisms also ensure that data updates are traceable and recoverable, supporting transactional integrity at scale.
