TutorChase logo
Login
AQA A-Level Computer Science

20.1.1 Definition and Characteristics of Big Data

Big Data refers to data sets that are too vast, fast-changing, or diverse for traditional systems to store, manage, or process efficiently.

What is Big Data?

Big Data is a term used to describe extremely large, rapidly changing, and complex data sets that cannot be handled using traditional data processing tools such as relational databases. These data sets often involve massive volumes of information being generated in real time from a wide range of sources. They may be structured, semi-structured, or unstructured and usually contain valuable insights that can only be uncovered using advanced techniques.

To put it simply, Big Data includes data that is:

  • Too large to be stored on a single machine or processed efficiently using conventional tools.

  • Too fast in its rate of generation and arrival to be handled using batch processing methods.

  • Too varied in its formats and structures to fit into standard databases and schemas.

Modern organisations must collect, store, and analyse these massive data sets to remain competitive, particularly in industries like finance, health care, marketing, social media, and scientific research.

The Three V’s of Big Data

Take your grades to the next level!

UPGRADING TO PREMIUM UNLOCKS
AI Tutor
AI-powered study assistant
instant feedback and guidance
Predicted Papers
Examiner-style predicted papers
based on recent exam trends
Practice Questions
All exam practice questions
by topic for each subject
Study Notes
All detailed revision notes
written by expert teachers
Cheat Sheets
Quick revision summaries
perfect for last-minute review
Past Papers
Complete collection
of practice and past exam papers
Email
Password
Confirm Password
Already have an account?

Practice Questions

FAQ

Scalability is crucial in Big Data systems because data volumes are constantly increasing, often unpredictably. Traditional systems that rely on vertical scaling (adding more power to a single machine) quickly become limited by hardware constraints and costs. Horizontal scaling, on the other hand, involves adding more machines to a network, distributing data and processing tasks across multiple nodes. This approach allows systems to handle much larger datasets without overwhelming a single server. It also improves fault tolerance, as failure in one node doesn't bring down the entire system. Additionally, it offers flexibility—new nodes can be added or removed as needed based on demand. This is especially important in Big Data environments where workloads fluctuate. Technologies like distributed file systems (e.g. Hadoop HDFS) and parallel processing frameworks (e.g. Apache Spark) are built with horizontal scaling in mind, enabling systems to remain responsive, reliable, and efficient even as data volumes grow into petabytes or beyond.

Data redundancy refers to storing multiple copies of the same data across different nodes in a distributed system. In Big Data environments, where massive volumes of data are stored across many machines, redundancy is essential for ensuring data reliability and availability. Hardware failures are not just likely—they are expected—due to the sheer number of servers involved. Redundancy ensures that if one server or storage device fails, another copy of the data exists elsewhere and can be retrieved without disruption. Systems like the Hadoop Distributed File System (HDFS) typically store three copies of each data block by default, each on different nodes. This redundancy also facilitates faster data access, as requests can be served by the nearest or least busy node. Furthermore, redundancy supports load balancing, as the system can distribute read and write operations across different copies to improve performance. Without redundancy, a single point of failure could result in significant data loss or system downtime.

Schema-on-read is a data processing approach used in many Big Data systems where the structure of the data is only applied when it is read or analysed, rather than when it is written or stored. This contrasts with schema-on-write, which is the traditional model used in relational databases where data must conform to a predefined schema before being stored. Schema-on-read provides greater flexibility because it allows raw, unstructured, or semi-structured data to be stored without knowing in advance how it will be used. Analysts can later apply different structures depending on the use case or query requirements. This is especially useful in Big Data scenarios where data sources are varied and evolving, and the use cases for the data may change over time. Schema-on-read supports rapid data ingestion, allowing systems to collect and store data without delay. However, it may result in more complex analysis workflows since data must be parsed and interpreted at query time.

Metadata and data catalogues play a critical role in making Big Data manageable and usable. Metadata is data about data—it provides descriptive information such as data type, source, creation date, format, size, and relationships to other data sets. In massive Big Data environments, where data is spread across different formats and storage locations, metadata helps users understand what the data is, where it comes from, and how it can be used. A data catalogue is an organised inventory of all datasets available in a system, typically enriched with metadata, user tags, quality scores, and usage statistics. It enables users to search, filter, and explore datasets more efficiently without needing to inspect the raw data directly. Data catalogues often include features like data lineage (tracking how data has changed over time) and access control (managing who can see or edit certain data). Together, metadata and catalogues improve data discovery, reduce duplication, and support governance and compliance in complex Big Data infrastructures.

Data lakes are central to modern Big Data architectures because they provide a scalable and flexible storage solution for vast amounts of raw data in various formats. Unlike traditional data warehouses, which store structured data in predefined tables with strict schemas, data lakes store all data—structured, semi-structured, and unstructured—in its native format. This includes logs, images, videos, audio, and documents alongside traditional table-based data. Data lakes use schema-on-read, which allows users to apply structure as needed at the time of analysis, rather than enforcing it beforehand. This enables faster data ingestion and supports diverse analytical needs from different departments within an organisation. Data lakes are typically built using distributed storage systems like Amazon S3 or Hadoop HDFS, making them highly scalable. In contrast, data warehouses are optimised for specific, repeated queries and business reporting, often using costly, high-performance hardware. Data lakes, therefore, offer greater flexibility, lower storage costs, and broader analytical capability across diverse Big Data workloads.

Hire a tutor

Please fill out the form and we'll find a tutor for you.

1/2
Your details
Alternatively contact us via
WhatsApp, Phone Call, or Email