</>
character.codes
← Back to Learn

Understanding Unicode: Code Points, UTF-8, and Beyond

Published March 1, 2025

What is Unicode?

Unicode is a universal character encoding standard that assigns a unique number — called a code point — to every character in every writing system. It covers Latin, Cyrillic, Chinese, Arabic, emoji, mathematical symbols, and much more.

Before Unicode, competing encoding standards (ASCII, ISO 8859, Shift_JIS, etc.) each covered a limited set of characters. This made it impossible to reliably mix scripts in a single document. Unicode solved this by creating one encoding to rule them all.

Code point notation

A Unicode code point is written as U+ followed by four to six hexadecimal digits. For example:

  • U+0041 — Latin capital letter A
  • U+00A9 — Copyright sign ©
  • U+1F600 — Grinning face 😀

The Unicode standard currently defines over 149,000 characters across 161 scripts.

Why it replaced ASCII

ASCII only encodes 128 characters (code points 0–127), which covers English letters, digits, and basic punctuation. That's nowhere near enough for the world's languages. Unicode is a superset of ASCII — the first 128 Unicode code points are identical to ASCII, so ASCII text is valid Unicode.

Encodings: UTF-8, UTF-16, UTF-32

Unicode defines code points, but a separate encoding determines how those code points are stored as bytes.

  • UTF-8 — Variable width (1–4 bytes per character). ASCII characters use 1 byte, making it backwards-compatible with ASCII. The dominant encoding on the web (used by over 98% of websites).
  • UTF-16 — Variable width (2 or 4 bytes). Used internally by JavaScript, Java, and Windows. Characters outside the Basic Multilingual Plane (above U+FFFF) require a surrogate pair of two 16-bit code units.
  • UTF-32 — Fixed width (4 bytes per character). Simple indexing but wasteful for text that's mostly ASCII. Rarely used in practice.

Unicode in practice

In most programming languages, you can insert a Unicode character by its code point using escape sequences:

  • JavaScript: "\u00A9" or "\u{1F600}"
  • Python: "\u00A9" or "\U0001F600"
  • HTML: &#xA9; or &#169;