Unicode, UTF-8, and Why "String Length" Lies to You
Codepoints, code units, grapheme clusters, surrogate pairs. Learn why JavaScript says "π¨βπ©βπ§".length is 8, why emoji break your form validation, and how to actually count characters correctly.
Strings look simple until you encounter an emoji, a non-Latin script, or international names. "Length" isn't one number β it can be three different things depending on what you mean. Truncating a tweet, validating a username, or pricing per-character SMS all hit Unicode complexity that most code is blind to.
The layered model
Unicode separates several concepts that pre-Unicode encodings conflated:
- Character / Codepoint: a unique number assigned to each character. "A" is U+0041. "π" is U+1F600. There are about 150,000 codepoints assigned in Unicode 16.
- Encoding: how those numbers are turned into bytes. UTF-8, UTF-16, UTF-32 are all valid encodings of the same Unicode codepoints.
- Code unit: the fixed-size chunks an encoding splits into. UTF-8 uses 8-bit units; UTF-16 uses 16-bit units.
- Grapheme cluster: a single user-perceived "character". May span multiple codepoints. "π¨βπ©βπ§" is one grapheme cluster but 5 codepoints.
UTF-8: variable-width 8-bit encoding
UTF-8 encodes each codepoint in 1β4 bytes:
- U+0000βU+007F (ASCII): 1 byte. Backwards-compatible with ASCII.
- U+0080βU+07FF: 2 bytes (Latin Extended, Greek, Cyrillic, Hebrew, Arabic).
- U+0800βU+FFFF: 3 bytes (most Asian scripts, common symbols).
- U+10000βU+10FFFF: 4 bytes (rare scripts, emoji).
UTF-8 is the dominant web encoding (98%+ of websites). Use it everywhere unless you have a strong reason not to.
UTF-16: the JavaScript trap
JavaScript strings are sequences of UTF-16 code units. For codepoints above U+FFFF (most emoji), UTF-16 uses two 16-bit code units called a "surrogate pair."
'A'.length // 1 (one code unit)
'Γ©'.length // 1 (single code unit)
'δΈ'.length // 1 (single code unit)
'π'.length // 2 (surrogate pair: D83D DE00)
'π¨βπ©βπ§'.length // 8 (emoji + ZWJ + emoji + ZWJ + emoji)The string "π" is one character to humans, one codepoint, but two UTF-16 code units. JavaScript reports length as 2.
Why this is a security and UX issue
Iterating correctly
For codepoints, use the spread operator or for-of:
[...'π'].length // 1 (codepoints)
Array.from('π').length // 1
'π'.length // 2 (UTF-16 code units β wrong)
for (const ch of 'πa') {
console.log(ch); // 'π', then 'a'
}For grapheme clusters (the user-perceived "character"), use Intl.Segmenter:
const seg = new Intl.Segmenter('en', { granularity: 'grapheme' });
const segments = [...seg.segment('π¨βπ©βπ§ hi')];
console.log(segments.length); // 4 (family emoji + space + h + i)Intl.Segmenter is the modern, correct way. Older browsers may need polyfills.
Normalization: the "same-but-different" problem
Unicode allows multiple ways to represent the same visual character:
- "Γ©" can be U+00E9 (single precomposed character)...
- ... or U+0065 + U+0301 (e + combining acute accent).
These look identical but are not equal as strings:
'Γ©'.length // 1
'eΜ'.length // 2
'Γ©' === 'eΜ' // falseFor comparison, normalize first:
'Γ©'.normalize() === 'eΜ'.normalize() // trueFour normalization forms:
- NFC (Canonical Composition): default for most use. Compresses where possible.
- NFD (Canonical Decomposition): expands. Useful for diacritic-insensitive search.
- NFKC / NFKD: "compatibility" forms. Treats "ο¬" (ligature) as "fi". Use when comparing text for search; avoid for storage.
Common emojis as multi-codepoint sequences
Modern emojis often combine multiple codepoints with Zero-Width Joiners (ZWJ, U+200D):
- π¨βπ©βπ§: π¨ + ZWJ + π© + ZWJ + π§ = 5 codepoints, 1 grapheme.
- ππ½: π + skin tone modifier = 2 codepoints, 1 grapheme.
- πΊπΈ: Regional Indicator U + Regional Indicator S = 2 codepoints, 1 grapheme.
Display as one icon if the platform supports the combination; otherwise as separate components. Always test with target platforms β Slack, Twitter, iOS, Android, Windows all render some sequences differently.
Bytes vs characters
For storage limits, byte count is what matters:
new TextEncoder().encode('A').length // 1 byte
new TextEncoder().encode('Γ©').length // 2 bytes (U+00E9)
new TextEncoder().encode('δΈ').length // 3 bytes
new TextEncoder().encode('π').length // 4 bytesSMS pricing, MySQL VARCHAR limits, and storage validation all care about bytes. UTF-8 makes this 1β4Γ the codepoint count.
Database considerations
- MySQL
utf8is broken. It only supports up to 3 bytes per character β no emoji. Always useutf8mb4. - PostgreSQL is full UTF-8 by default. No special configuration needed.
- VARCHAR(255) means different things. Some databases count characters; others count bytes. Check yours.
- Collation affects sorting and comparison. Choose case-insensitive or case-sensitive deliberately.
Common bugs
- Truncating at code-unit boundaries. Splits surrogate pairs, produces invalid strings.
- Counting JavaScript .length as "characters." Wrong for any non-BMP character.
- Form validation by length. A user with an emoji name fails validation that ASCII-only users don't.
- Displaying byte length as character count. Says "3 characters" for an emoji.
- Comparing without normalization. User logs in with one form, system stores another, login fails.
- UTF-8 BOM in JSON files. Some tools choke. Strip it server-side.
Testing checklist
For any feature handling user-provided text, test with:
- Plain ASCII (baseline)
- Latin with diacritics (Γ©, Γ±, ΓΆ)
- Non-Latin scripts (δΈ, ζ₯, ΨΉ, ΰ€Ήΰ€Ώΰ€¨ΰ₯ΰ€¦ΰ₯)
- Single-codepoint emoji (π)
- Multi-codepoint emoji (π¨βπ©βπ§, ππ½, πΊπΈ)
- Right-to-left text (Arabic, Hebrew)
- Combining marks (e + Μ = Γ©, but as two codepoints)
Key Takeaways
- Unicode separates codepoints (the abstract characters) from encodings (UTF-8/16/32 byte representations).
- JavaScript .length counts UTF-16 code units. Emoji often need 2 code units β "π".length is 2.
- Use spread or for-of to iterate codepoints; Intl.Segmenter for grapheme clusters (user-visible characters).
- Normalize strings before comparison. NFC for storage and equality; NFKC for search.
- Test with emoji, non-Latin scripts, and combining marks. ASCII-only testing hides Unicode bugs.