2.4.3 Character Encoding

Characters, such as letters and symbols, must be stored in binary for computers to process them. This requires specific character sets like ASCII and Unicode.

What Is Character Encoding?

Character encoding is the method of converting characters into a form that computers can understand and process. Computers can only deal with binary data—strings of 0s and 1s—so every letter, number, and symbol must be represented in binary.

Character encoding ensures:

Text can be stored and retrieved accurately.
Different computers and systems can interpret text in the same way.
Communication between devices remains consistent across platforms.

Binary Representation of Characters

Every character that appears on a screen—letters, numbers, punctuation, and control characters—has a unique binary code assigned to it. This allows characters to be stored in memory, transmitted over networks, and displayed correctly.

Important points about binary character representation:

Each character is given a unique binary value.
The number of bits used determines how many different characters can be represented.
Commonly used character sets are ASCII and Unicode.

ASCII: American Standard Code for Information Interchange

ASCII is one of the earliest character encoding standards and is still widely used today, especially in simpler systems.

Features of ASCII

Uses 8 bits (1 byte) per character.
Originally designed as a 7-bit code, allowing for 128 characters (0–127).
With the 8th bit added later, 256 characters (0–255) became possible.
Includes:
- Uppercase and lowercase letters (A-Z, a-z)
- Digits (0-9)
- Common punctuation marks (e.g., !, @, #)
- Control characters (e.g., carriage return, line feed)

Examples of ASCII Codes

A → 65 → 01000001
a → 97 → 01100001
0 → 48 → 00110000

Importance of ASCII

Became the foundation for text encoding on early computers.
Ensures that basic text can be displayed and understood across different systems.
Limitations: Only supports English characters and very basic symbols, making it unsuitbale for international languages.

Unicode: A Universal Character Set

Unicode was created to address the limitations of ASCII by allowing a much greater range of characters to be represented.

Features of Unicode

Supports over 143,000 characters covering many languages, symbols, and emojis.
Characters can be represented using different encoding forms:
- UTF-8: Uses 1 to 4 bytes per character.
- UTF-16: Uses 2 or 4 bytes per character.
- UTF-32: Uses 4 bytes per character.
Designed to be backward compatible with ASCII for the first 128 characters.
Supports scripts such as Chinese, Arabic, Cyrillic, Devanagari, and many others.

Examples of Unicode Codes

A → U+0041
a → U+0061
😊 (smiling face emoji) → U+1F60A

Importance of Unicode

Enables global communication by supporting almost every written language.
Allows symbols and emojis to be represented.
Essential for modern websites, software applications, and operating systems.

Number of Bits and Character Set Size

The number of bits assigned per character directly affects the total number of characters that can be represented.

Key Points

More bits = more characters that can be represented.
7 bits → 2⁷ = 128 characters (original ASCII).
8 bits → 2⁸ = 256 characters (extended ASCII).
16 bits → 2¹⁶ = 65,536 characters (sufficient for many Unicode characters).
32 bits → 2³² = over 4 billion characters (enough for all possible symbols).

Practical Impact

Systems using more bits per character require more memory and storage space.
However, they provide greater flexibility and international support.

Logical Ordering of Character Sets

Character sets are logically ordered to make searching and organizing characters easier.

How Logical Ordering Works

Characters are typically ordered alphabetically and numerically.
Related characters are grouped together for efficiency.
For example, in ASCII:
- Uppercase A–Z characters are sequential (65–90).
- Lowercase a–z characters are sequential (97–122).
- Digits 0–9 are sequential (48–57).

Benefits of Logical Ordering

Simplifies sorting algorithms.
Makes text comparison operations more efficient.
Allows for easy manipulation of character codes (e.g., converting between uppercase and lowercase by adding or subtracting a fixed value).

Differences Between ASCII and Unicode

Coverage

ASCII: Limited to English letters, digits, basic punctuation, and control codes.
Unicode: Covers almost all written languages, technical symbols, and emojis.

Bit Usage

ASCII: 7 or 8 bits per character.
Unicode: Variable (8, 16, or 32 bits depending on the encoding).

Compatibility

Unicode includes ASCII as a subset, making it backward compatible with systems that rely on ASCII.

Storage and Memory

ASCII requires less memory because of its limited character set.
Unicode can be memory-intensive, especially in UTF-32, but is essential for supporting a diverse range of symbols and languages.

Use Cases

ASCII: Simple systems, basic English text files, early internet protocols.
Unicode: Modern websites, multilingual databases, international software development.

Binary Representation of ASCII

Understanding the binary representation of ASCII characters is essential for OCR GCSE Computer Science students.

How to Represent ASCII in Binary

Each character corresponds to a decimal number.
The decimal number is converted to an 8-bit binary number.

Steps:

Find the ASCII decimal code for the character.
Convert the decimal code to binary.

Example:

Character: B
ASCII decimal code: 66
Binary representation: 01000010

Important Details

Leading zeros are added to ensure that all binary representations are 8 bits long.
Consistency in the length of binary codes is crucial for accurate data processing.

Impacts of Character Sets

The choice of character set impacts:

Software development: Must ensure compatibility with user languages.
Data storage: Larger character sets require more storage.
Data transmission: More bits per character can slow down data transmission without efficient encoding methods.

In Summary

ASCII is suitable for simple English text and requires less storage.
Unicode is necessary for global applications but demands more sophisticated handling and more memory.

FAQ

ASCII was originally developed for early computers in the United States, focusing only on English letters, digits, and a few punctuation marks. It was extremely limited because it used just 7 or 8 bits, meaning it could represent only 128 or 256 characters. As computing became more global, people needed a way to represent characters from languages like Chinese, Arabic, Japanese, and Russian, which have thousands of unique symbols. ASCII simply could not handle that diversity. Unicode was created to provide a universal standard that could cover all characters from every writing system in the world. It includes a massive range of symbols, special characters, and even emojis. Unicode's flexibility ensures that text is displayed correctly across different devices, languages, and countries, making it essential for global communication, international business, multilingual websites, and cross-platform software development. Without Unicode, the internet and many modern applications would not function smoothly for international users.

UTF-8 is one of the most popular encoding formats for Unicode characters because it is both space-efficient and backward compatible with ASCII. It uses a variable number of bytes (from 1 to 4) to represent characters. The first 128 Unicode characters (which match ASCII) are stored in a single byte, making it extremely efficient for texts that are mostly in English. Characters beyond this range use 2, 3, or 4 bytes. This design allows UTF-8 to handle everything from simple English text to complex scripts like Chinese and Arabic without wasting memory unnecessarily. Another major advantage is that UTF-8 is self-synchronizing, meaning that if a byte is corrupted or lost, the system can recover quickly by looking for the next valid byte sequence. Its widespread adoption in web technologies, including HTML and JSON, and its ability to support both simple and complex characters seamlessly, make it the preferred encoding format worldwide.

Control characters in ASCII are special codes that do not represent visible symbols but instead perform specific functions in data streams or text formatting. They occupy the ASCII values 0 to 31 and 127. Examples include the carriage return (CR, code 13), line feed (LF, code 10), and tab (TAB, code 9). Originally, control characters were used for managing printers and teletypes, instructing them to start new lines, move to the next page, or trigger an alert (like the bell sound). In modern computing, they still play an important role. For example, line feed and carriage return are essential for formatting text files properly across different operating systems. In network communication, control characters can signal the start or end of a transmission. Although users rarely interact with these characters directly, they are crucial behind the scenes for organizing data, ensuring smooth device communication, and formatting documents across platforms.

Unicode was specifically designed to maintain backward compatibility with ASCII to ease the transition from older systems to new globalized systems. The first 128 Unicode code points (U+0000 to U+007F) exactly match the ASCII character set. This means that any text written in standard 7-bit ASCII is automatically valid in Unicode without requiring any changes or re-encoding. For example, the letter "A" in ASCII (decimal 65) is the same as Unicode code point U+0041. Because of this careful design choice, files and systems originally built using ASCII can seamlessly integrate with Unicode environments. This compatibility minimizes disruptions when upgrading software and hardware to support Unicode. It also simplifies programming because developers can handle ASCII text and Unicode text in a unified way. This backward compatibility was one of the key reasons Unicode adoption was successful and widespread, allowing systems to support both legacy and modern content efficiently.

In theory, character encoding standards like ASCII and Unicode are universal and should be interpreted the same way on any compliant system. However, in practice, misinterpretations can occur if systems do not agree on which character encoding is being used. For example, if a file is created in UTF-8 but a program reads it as ISO-8859-1 (an older character set for Western European languages), some characters may display incorrectly, showing as strange symbols or question marks. This is known as "mojibake," where text becomes unreadable due to encoding mismatches. To prevent this, modern systems often include metadata, like a "charset" declaration in web pages, to explicitly state the encoding being used. Additionally, software developers often implement automatic encoding detection to avoid confusion. When systems consistently use and declare the correct encoding standard, especially Unicode formats like UTF-8, interoperability is greatly improved, and data corruption or display issues are minimized.

Practice Questions

Explain how the number of bits used in a character set impacts the range of characters that can be represented. Give an example using ASCII and Unicode.

The number of bits in a character set determines how many different characters can be represented. More bits allow for a greater range of characters. ASCII uses 7 or 8 bits, allowing 128 or 256 characters, enough for English letters, digits, and basic symbols. Unicode, using up to 32 bits, can represent over 4 billion characters, including many global languages, emojis, and technical symbols. For example, ASCII can represent "A" and "B," but Unicode can represent complex Chinese characters and emojis. Therefore, using more bits increases flexibility and ensures global communication is possible across different platforms.

Describe the main differences between ASCII and Unicode and explain why Unicode is important in modern computing.

ASCII is a character set that uses 7 or 8 bits to represent basic English letters, digits, and symbols, totaling up to 256 characters. Unicode, on the other hand, uses up to 32 bits, allowing the representation of over 143,000 characters, covering nearly every written language and symbol in use today. Unicode is important because modern computing requires support for multiple languages and diverse symbols, such as emojis, in websites, software, and communication platforms. By using Unicode, systems can handle international content, making applications accessible and usable for people all around the world without compatibility issues.

Try All Topic Practice Questions

Written by:

Alfie

Profile

Cambridge University - BA Maths

A Cambridge alumnus, Alfie is a qualified teacher, and specialises creating educational materials for Computer Science for high school students.