TutorChase logo
Login
AQA A-Level Computer Science

14.5.2 ASCII and Unicode

Character encoding systems like ASCII and Unicode are essential for representing text in digital devices. They enable computers to interpret, store, and display characters consistently across all hardware and software environments.

Introduction to ASCII

The American Standard Code for Information Interchange (ASCII) is one of the earliest and most widely used character encoding systems. It was developed in the early 1960s by the American National Standards Institute (ANSI) to standardise how characters are represented and stored in binary form by digital devices.

ASCII assigns a unique binary number to each character, allowing letters, digits, and symbols to be converted into a format that computers can process. It forms the foundation for more advanced character encoding systems, including Unicode.

Standard ASCII (7-bit)

The original ASCII system uses 7 bits per character, allowing for a total of 128 unique characters, which are represented using values from 0 to 127.

These 128 characters are divided into two major groups:

  • Control characters (values 0–31): These are non-printing characters used for controlling devices such as printers or for signalling within data streams. Examples include:

    • NUL (null character) – value 0

    • LF (line feed) – value 10

    • CR (carriage return) – value 13

    • DEL (delete) – value 127

  • Printable characters (values 32–126): These include:

    • Uppercase letters: ‘A’ to ‘Z’ (values 65–90)

Take your grades to the next level!

UPGRADING TO PREMIUM UNLOCKS
AI Tutor
AI-powered study assistant
instant feedback and guidance
Predicted Papers
Examiner-style predicted papers
based on recent exam trends
Practice Questions
All exam practice questions
by topic for each subject
Study Notes
All detailed revision notes
written by expert teachers
Cheat Sheets
Quick revision summaries
perfect for last-minute review
Past Papers
Complete collection
of practice and past exam papers
Email
Password
Confirm Password
Already have an account?

Practice Questions

FAQ

ASCII remains in use today because of its simplicity, efficiency, and deep integration into legacy systems. Since ASCII uses only 7 bits per character, it is highly compact and sufficient for systems that only require basic English text and common control characters. Many programming languages, network protocols, and file formats were originally built around ASCII, and maintaining compatibility with these standards is essential for reliability. Additionally, UTF-8—the most commonly used Unicode encoding—is backwards compatible with ASCII, meaning any ASCII-encoded text is valid UTF-8. This allows systems to retain ASCII-based operations while also supporting a broader character set through Unicode. For example, configuration files, source code, and command-line interfaces often still use ASCII, especially when simplicity and minimal memory usage are key. In embedded systems, ASCII is still ideal due to limited processing power and memory. Therefore, while Unicode dominates modern international applications, ASCII's persistence is a matter of practicality and efficiency in specific contexts.

Unicode accommodates large character sets like Chinese, Japanese, and Korean (collectively known as CJK scripts) through careful organisation into designated blocks and the use of variable-length encodings. For these scripts, Unicode assigns thousands of unique code points in dedicated ranges such as the CJK Unified Ideographs block. UTF-8, which is a variable-length encoding, stores these characters using three or four bytes. Although this increases the byte size compared to ASCII, it enables full representation of complex logographic characters. UTF-16 also handles these characters efficiently by using two-byte sequences or surrogate pairs. To ensure clarity and avoid redundancy, Unicode has a process called Han unification, which merges similar characters across Chinese, Japanese, and Korean while maintaining language-specific rendering through fonts and locale settings. Rendering engines use font files and language context to display the correct glyph even if the underlying code point is shared. This allows Unicode to support vast scripts without inflating the core encoding system unnecessarily.

Combining characters in Unicode are special code points that modify the character that precedes them, typically used for accents, diacritics, or phonetic markers. For example, the combining acute accent (U+0301) can be added to the letter ‘e’ (U+0065) to form ‘é’ without using a precomposed code point. This approach allows Unicode to represent thousands of character variants without assigning a unique code point to each. Combining characters are essential for supporting languages with rich diacritical systems, such as Vietnamese or some African and Eastern European languages. They also help reduce redundancy in the Unicode table and make it more flexible for linguists and scholars who work with rare or historic scripts. However, they add complexity in processing, as software must correctly combine and display base characters with one or more combining characters. For this reason, Unicode also supports normalisation forms that convert combined sequences into single precomposed characters or vice versa, ensuring consistency in text handling.

Unicode defines code points for characters, but it does not control how those characters are visually rendered—that responsibility lies with the fonts and rendering engines used by the system. To ensure consistency, operating systems, applications, and web browsers rely on font files that map Unicode code points to specific graphical glyphs. If a font supports a particular Unicode character, it will render it accurately. To deal with differences across systems, font designers follow typographic standards and Unicode Consortium guidelines when assigning glyphs to code points. Additionally, rendering engines use locale information to determine script direction, contextual forms, and shaping behaviours, especially for scripts like Arabic or Hindi. When a character is not supported by a font, the system may substitute a fallback font or display a placeholder (such as a box or question mark). Web developers often embed or specify web fonts to ensure their text appears consistently, regardless of the user's platform.

Converting from ASCII to Unicode is typically straightforward because Unicode was designed to include ASCII as a subset—characters in the 0 to 127 range have the same code points in both systems. However, issues arise when dealing with Extended ASCII, where the upper 128 characters (128–255) are not standardised and vary between encoding schemes such as ISO 8859-1, Windows-1252, and MacRoman. These characters do not map consistently to Unicode without knowing the source encoding. If the wrong encoding is assumed, text can become garbled or misinterpreted, leading to mojibake—nonsensical characters replacing the original ones. Additionally, converting Unicode back to ASCII may result in data loss because ASCII cannot represent most non-English characters, emoji, or special symbols. Software must either omit unsupported characters or replace them with approximate equivalents, which can affect meaning. To manage this, careful encoding detection, error handling, and the use of libraries that support encoding conversions are essential in applications dealing with international text.

Hire a tutor

Please fill out the form and we'll find a tutor for you.

1/2
Your details
Alternatively contact us via
WhatsApp, Phone Call, or Email