Unicode, UTF-8, and Why "String Length" Lies to You

Strings look simple until you encounter an emoji, a non-Latin script, or international names. "Length" isn't one number — it can be three different things depending on what you mean. Truncating a tweet, validating a username, or pricing per-character SMS all hit Unicode complexity that most code is blind to.

The layered model

Unicode separates several concepts that pre-Unicode encodings conflated:

Character / Codepoint: a unique number assigned to each character. "A" is U+0041. "😀" is U+1F600. There are about 150,000 codepoints assigned in Unicode 16.
Encoding: how those numbers are turned into bytes. UTF-8, UTF-16, UTF-32 are all valid encodings of the same Unicode codepoints.
Code unit: the fixed-size chunks an encoding splits into. UTF-8 uses 8-bit units; UTF-16 uses 16-bit units.
Grapheme cluster: a single user-perceived "character". May span multiple codepoints. "👨‍👩‍👧" is one grapheme cluster but 5 codepoints.

UTF-8: variable-width 8-bit encoding

UTF-8 encodes each codepoint in 1–4 bytes:

U+0000–U+007F (ASCII): 1 byte. Backwards-compatible with ASCII.
U+0080–U+07FF: 2 bytes (Latin Extended, Greek, Cyrillic, Hebrew, Arabic).
U+0800–U+FFFF: 3 bytes (most Asian scripts, common symbols).
U+10000–U+10FFFF: 4 bytes (rare scripts, emoji).

UTF-8 is the dominant web encoding (98%+ of websites). Use it everywhere unless you have a strong reason not to.

UTF-16: the JavaScript trap

JavaScript strings are sequences of UTF-16 code units. For codepoints above U+FFFF (most emoji), UTF-16 uses two 16-bit code units called a "surrogate pair."

'A'.length      // 1 (one code unit)
'é'.length      // 1 (single code unit)
'中'.length     // 1 (single code unit)
'😀'.length     // 2 (surrogate pair: D83D DE00)
'👨‍👩‍👧'.length    // 8 (emoji + ZWJ + emoji + ZWJ + emoji)

The string "😀" is one character to humans, one codepoint, but two UTF-16 code units. JavaScript reports length as 2.

Why this is a security and UX issue

Server-side validation truncating usernames at 30 "characters" via JavaScript string length actually limits emoji users to 15 visible characters. Worse, splitting a string at code-unit boundaries can produce invalid surrogate pairs — broken text. Validation, truncation, and indexing all need Unicode-aware handling.

Iterating correctly

For codepoints, use the spread operator or for-of:

[...'😀'].length         // 1 (codepoints)
Array.from('😀').length  // 1
'😀'.length              // 2 (UTF-16 code units — wrong)

for (const ch of '😀a') {
  console.log(ch); // '😀', then 'a'
}

For grapheme clusters (the user-perceived "character"), use Intl.Segmenter:

const seg = new Intl.Segmenter('en', { granularity: 'grapheme' });
const segments = [...seg.segment('👨‍👩‍👧 hi')];
console.log(segments.length); // 4 (family emoji + space + h + i)

Intl.Segmenter is the modern, correct way. Older browsers may need polyfills.

Normalization: the "same-but-different" problem

Unicode allows multiple ways to represent the same visual character:

"é" can be U+00E9 (single precomposed character)...
... or U+0065 + U+0301 (e + combining acute accent).

These look identical but are not equal as strings:

'é'.length             // 1
'é'.length       // 2
'é' === 'é' // false

For comparison, normalize first:

'é'.normalize() === 'é'.normalize()  // true

Four normalization forms:

NFC (Canonical Composition): default for most use. Compresses where possible.
NFD (Canonical Decomposition): expands. Useful for diacritic-insensitive search.
NFKC / NFKD: "compatibility" forms. Treats "ﬁ" (ligature) as "fi". Use when comparing text for search; avoid for storage.

Common emojis as multi-codepoint sequences

Modern emojis often combine multiple codepoints with Zero-Width Joiners (ZWJ, U+200D):

👨‍👩‍👧: 👨 + ZWJ + 👩 + ZWJ + 👧 = 5 codepoints, 1 grapheme.
👍🏽: 👍 + skin tone modifier = 2 codepoints, 1 grapheme.
🇺🇸: Regional Indicator U + Regional Indicator S = 2 codepoints, 1 grapheme.

Display as one icon if the platform supports the combination; otherwise as separate components. Always test with target platforms — Slack, Twitter, iOS, Android, Windows all render some sequences differently.

Bytes vs characters

For storage limits, byte count is what matters:

new TextEncoder().encode('A').length     // 1 byte
new TextEncoder().encode('é').length     // 2 bytes (U+00E9)
new TextEncoder().encode('中').length    // 3 bytes
new TextEncoder().encode('😀').length    // 4 bytes

SMS pricing, MySQL VARCHAR limits, and storage validation all care about bytes. UTF-8 makes this 1–4× the codepoint count.

Database considerations

MySQL utf8 is broken. It only supports up to 3 bytes per character — no emoji. Always use utf8mb4.
PostgreSQL is full UTF-8 by default. No special configuration needed.
VARCHAR(255) means different things. Some databases count characters; others count bytes. Check yours.
Collation affects sorting and comparison. Choose case-insensitive or case-sensitive deliberately.

Common bugs

Truncating at code-unit boundaries. Splits surrogate pairs, produces invalid strings.
Counting JavaScript .length as "characters." Wrong for any non-BMP character.
Form validation by length. A user with an emoji name fails validation that ASCII-only users don't.
Displaying byte length as character count. Says "3 characters" for an emoji.
Comparing without normalization. User logs in with one form, system stores another, login fails.
UTF-8 BOM in JSON files. Some tools choke. Strip it server-side.

Testing checklist

For any feature handling user-provided text, test with:

Plain ASCII (baseline)
Latin with diacritics (é, ñ, ö)
Non-Latin scripts (中, 日, ع, हिन्दी)
Single-codepoint emoji (😀)
Multi-codepoint emoji (👨‍👩‍👧, 👍🏽, 🇺🇸)
Right-to-left text (Arabic, Hebrew)
Combining marks (e + ́ = é, but as two codepoints)

Key Takeaways

Unicode separates codepoints (the abstract characters) from encodings (UTF-8/16/32 byte representations).
JavaScript .length counts UTF-16 code units. Emoji often need 2 code units — "😀".length is 2.
Use spread or for-of to iterate codepoints; Intl.Segmenter for grapheme clusters (user-visible characters).
Normalize strings before comparison. NFC for storage and equality; NFKC for search.
Test with emoji, non-Latin scripts, and combining marks. ASCII-only testing hides Unicode bugs.