Developer

Unicode, UTF-8, and Why "String Length" Lies to You

Codepoints, code units, grapheme clusters, surrogate pairs. Learn why JavaScript says "πŸ‘¨β€πŸ‘©β€πŸ‘§".length is 8, why emoji break your form validation, and how to actually count characters correctly.

β€’12 min read

Strings look simple until you encounter an emoji, a non-Latin script, or international names. "Length" isn't one number β€” it can be three different things depending on what you mean. Truncating a tweet, validating a username, or pricing per-character SMS all hit Unicode complexity that most code is blind to.

The layered model

Unicode separates several concepts that pre-Unicode encodings conflated:

  • Character / Codepoint: a unique number assigned to each character. "A" is U+0041. "πŸ˜€" is U+1F600. There are about 150,000 codepoints assigned in Unicode 16.
  • Encoding: how those numbers are turned into bytes. UTF-8, UTF-16, UTF-32 are all valid encodings of the same Unicode codepoints.
  • Code unit: the fixed-size chunks an encoding splits into. UTF-8 uses 8-bit units; UTF-16 uses 16-bit units.
  • Grapheme cluster: a single user-perceived "character". May span multiple codepoints. "πŸ‘¨β€πŸ‘©β€πŸ‘§" is one grapheme cluster but 5 codepoints.

UTF-8: variable-width 8-bit encoding

UTF-8 encodes each codepoint in 1–4 bytes:

  • U+0000–U+007F (ASCII): 1 byte. Backwards-compatible with ASCII.
  • U+0080–U+07FF: 2 bytes (Latin Extended, Greek, Cyrillic, Hebrew, Arabic).
  • U+0800–U+FFFF: 3 bytes (most Asian scripts, common symbols).
  • U+10000–U+10FFFF: 4 bytes (rare scripts, emoji).

UTF-8 is the dominant web encoding (98%+ of websites). Use it everywhere unless you have a strong reason not to.

UTF-16: the JavaScript trap

JavaScript strings are sequences of UTF-16 code units. For codepoints above U+FFFF (most emoji), UTF-16 uses two 16-bit code units called a "surrogate pair."

'A'.length      // 1 (one code unit)
'Γ©'.length      // 1 (single code unit)
'δΈ­'.length     // 1 (single code unit)
'πŸ˜€'.length     // 2 (surrogate pair: D83D DE00)
'πŸ‘¨β€πŸ‘©β€πŸ‘§'.length    // 8 (emoji + ZWJ + emoji + ZWJ + emoji)

The string "πŸ˜€" is one character to humans, one codepoint, but two UTF-16 code units. JavaScript reports length as 2.

Why this is a security and UX issue

Server-side validation truncating usernames at 30 "characters" via JavaScript string length actually limits emoji users to 15 visible characters. Worse, splitting a string at code-unit boundaries can produce invalid surrogate pairs β€” broken text. Validation, truncation, and indexing all need Unicode-aware handling.

Iterating correctly

For codepoints, use the spread operator or for-of:

[...'πŸ˜€'].length         // 1 (codepoints)
Array.from('πŸ˜€').length  // 1
'πŸ˜€'.length              // 2 (UTF-16 code units β€” wrong)

for (const ch of 'πŸ˜€a') {
  console.log(ch); // 'πŸ˜€', then 'a'
}

For grapheme clusters (the user-perceived "character"), use Intl.Segmenter:

const seg = new Intl.Segmenter('en', { granularity: 'grapheme' });
const segments = [...seg.segment('πŸ‘¨β€πŸ‘©β€πŸ‘§ hi')];
console.log(segments.length); // 4 (family emoji + space + h + i)

Intl.Segmenter is the modern, correct way. Older browsers may need polyfills.

Normalization: the "same-but-different" problem

Unicode allows multiple ways to represent the same visual character:

  • "Γ©" can be U+00E9 (single precomposed character)...
  • ... or U+0065 + U+0301 (e + combining acute accent).

These look identical but are not equal as strings:

'Γ©'.length             // 1
'é'.length       // 2
'é' === 'é' // false

For comparison, normalize first:

'é'.normalize() === 'é'.normalize()  // true

Four normalization forms:

  • NFC (Canonical Composition): default for most use. Compresses where possible.
  • NFD (Canonical Decomposition): expands. Useful for diacritic-insensitive search.
  • NFKC / NFKD: "compatibility" forms. Treats "fi" (ligature) as "fi". Use when comparing text for search; avoid for storage.

Common emojis as multi-codepoint sequences

Modern emojis often combine multiple codepoints with Zero-Width Joiners (ZWJ, U+200D):

  • πŸ‘¨β€πŸ‘©β€πŸ‘§: πŸ‘¨ + ZWJ + πŸ‘© + ZWJ + πŸ‘§ = 5 codepoints, 1 grapheme.
  • πŸ‘πŸ½: πŸ‘ + skin tone modifier = 2 codepoints, 1 grapheme.
  • πŸ‡ΊπŸ‡Έ: Regional Indicator U + Regional Indicator S = 2 codepoints, 1 grapheme.

Display as one icon if the platform supports the combination; otherwise as separate components. Always test with target platforms β€” Slack, Twitter, iOS, Android, Windows all render some sequences differently.

Bytes vs characters

For storage limits, byte count is what matters:

new TextEncoder().encode('A').length     // 1 byte
new TextEncoder().encode('Γ©').length     // 2 bytes (U+00E9)
new TextEncoder().encode('δΈ­').length    // 3 bytes
new TextEncoder().encode('πŸ˜€').length    // 4 bytes

SMS pricing, MySQL VARCHAR limits, and storage validation all care about bytes. UTF-8 makes this 1–4Γ— the codepoint count.

Database considerations

  • MySQL utf8 is broken. It only supports up to 3 bytes per character β€” no emoji. Always use utf8mb4.
  • PostgreSQL is full UTF-8 by default. No special configuration needed.
  • VARCHAR(255) means different things. Some databases count characters; others count bytes. Check yours.
  • Collation affects sorting and comparison. Choose case-insensitive or case-sensitive deliberately.

Common bugs

  • Truncating at code-unit boundaries. Splits surrogate pairs, produces invalid strings.
  • Counting JavaScript .length as "characters." Wrong for any non-BMP character.
  • Form validation by length. A user with an emoji name fails validation that ASCII-only users don't.
  • Displaying byte length as character count. Says "3 characters" for an emoji.
  • Comparing without normalization. User logs in with one form, system stores another, login fails.
  • UTF-8 BOM in JSON files. Some tools choke. Strip it server-side.

Testing checklist

For any feature handling user-provided text, test with:

  • Plain ASCII (baseline)
  • Latin with diacritics (Γ©, Γ±, ΓΆ)
  • Non-Latin scripts (δΈ­, ζ—₯, ΨΉ, ΰ€Ήΰ€Ώΰ€¨ΰ₯ΰ€¦ΰ₯€)
  • Single-codepoint emoji (πŸ˜€)
  • Multi-codepoint emoji (πŸ‘¨β€πŸ‘©β€πŸ‘§, πŸ‘πŸ½, πŸ‡ΊπŸ‡Έ)
  • Right-to-left text (Arabic, Hebrew)
  • Combining marks (e + ́ = Γ©, but as two codepoints)

Key Takeaways

  • Unicode separates codepoints (the abstract characters) from encodings (UTF-8/16/32 byte representations).
  • JavaScript .length counts UTF-16 code units. Emoji often need 2 code units β€” "πŸ˜€".length is 2.
  • Use spread or for-of to iterate codepoints; Intl.Segmenter for grapheme clusters (user-visible characters).
  • Normalize strings before comparison. NFC for storage and equality; NFKC for search.
  • Test with emoji, non-Latin scripts, and combining marks. ASCII-only testing hides Unicode bugs.