How big is a JavaScript string?

The many sizes of a JS string

Published in

Bits and Pieces

7 min readAug 18, 2022

Cat confused by JavaScript strings (photo by Monstera)

When measuring a string in JavaScript, string.length comes to mind. However, that property does not tell the full story. Strings have many sizes: code units, code points, bytes, pixels, terminal columns.

Let’s dive into their differences.

Code units

Each string element in JavaScript is a UTF-16 code unit. In other words, any string[index], commonly (and ambiguously) referred to as a string “character.”

Code units are fast and convenient to use since they underpin all string operations (with a few exceptions noted below) including string.length, string.slice(), string === string, string.replace() and so on.

Strings have a maximum size in code units. While the standard defines it as at most ~9e15, JavaScript engines implement much lower limits: ~5e8 with V8, ~1e9 with SpiderMonkey and ~2e9 with JavaScriptCore.

`// RangeError: Invalid string length \n ‘_’.repeat(1e16)`

Code points

A Unicode code point is a number identifying a single abstract character. It is often noted in hexadecimal: for example, the decimal number 129445 would be written U+01F9A5 🦥 (sloth). Unicode maintains a list of characters with code points ranging:

From U+000000 to U+00007F: ASCII
From U+000080 to U+00FFFF: BMP (Basic Multilingual Plane)
From U+010000 to U+10FFFF: Astral code points. This is where you’ll find emoji! 🎶

Usually, a UTF-16 code unit is equivalent to a code point. For example, "Olá!" has 4 code units with their own code points: U+004F, U+006C, U+00E1 and U+0021.

However, astral code points (U+010000 and above) are broken down into two code units:

First, a “high/leading surrogate” from \uD800 to \uDBFF
Then, a “low/trailing surrogate” from \uDC00 to \uDFFF

For example, U+01F9A5 becomes \uD83E and \uDDA5. This conversion from a code point to a surrogate pair is specific to UTF-16. If you’re curious about it, please check this code sample.

This only applies if the two surrogates follow each other in that order. In JavaScript, isolated or inverted surrogates are valid but considered their own code points and generally either invisible or printed as � (replacement character U+FFFD).

Since most string operations use code units, surrogates can be a problem. In some cases, they work just fine. For example, the following statements are correct because "🦥" is equivalent to "\uD83E\uDDA5", i.e. to the full surrogates pair.

`const string = “🦥” \n string === “🦥” // true \n /🦥/.test(string) // true \n string + “🐨” // “🦥🐨” \n string.replace(“🦥”, “🐨”) // “🐨” \n “🐨🦥🐨”.split(“🦥”) // [“🐨”, “🐨”]`

However, some string operations might target individual surrogates and split the pair.

`const string = “🦥” \n string[0] // “\uD83E” \n string[1] // “\uDDA5” \n string[0] === “🦥” // false \n /^.$/.test(string) // false \n /^..$/.test(string) // true \n string.length // 2 \n “🦥🐨🐨”.slice(2) // “🐨🐨” \n string.replace(/./g, “🐨”) // “🐨🐨” \n […string.matchAll(/./g)] // [[“\uD83E”], // [“\uDDA5”]]`

Fortunately, a few string operations use code points instead of code units:

Iterations, including [...string] and for (const codepoint of string), but excluding for (const index in string)
Regular expressions with the Unicode flag: /.../u
\u{000000} instead of \u0000
string.codePointAt() and String.fromCodePoint() instead of string.charCodeAt() and String.fromCharCode()
string.to*Case() and string.trim*()

`const string = “🦥” \n string === “\uD83E\uDDA5” // true \n string === “\u{01F9A5}” // true \n /^.$/u.test(string) // true \n […string].length // 1 (code points) \n […”🦥🐨🐨”].slice(2).join(“”) // “🐨” (slice/truncate by code point) \n string.replace(/./gu, “🐨”) // “🐨” \n […string.matchAll(/./gu)] // [[“🦥”]] \n string.charCodeAt(0) // 0xD83E \n string.charCodeAt(1) // 0xDDA5 \n string.codePointAt(0) // 0x1F9A5 \n string.codePointAt(1) // 0xDDA5. Careful: the return value of `codePoint`

Bytes

When a string is written to a file or sent over the network, it is first serialized to a series of bytes. This binary representation differs from code units and code points.

Character encodings translate each code point into one or several bytes. While there are quite many of them, the most common ones these days are UTF-16 and UTF-8.

Since JavaScript strings are based on UTF-16, their binary representation in that character encoding is straightforward (aside from endianness): each code unit translates to an equivalent bytes pair.

“Olá!” > “O” + “l” + “á” + “!” > “\u004F” + “\u006C” + “\u00E1” + “\u0021” > 00 4F 00 6C 00 E1 00 21. And “🦥” > “\uD83E” + “\uDDA5” > D8 3E DD A5.

For UTF-8, each code point translates to a series of 1 to 4 bytes. Lower Unicode code points take fewer bytes. In particular, ASCII characters are 1 byte long. On the other hand, astral code points are 4 bytes long. The conversion logic is explained in details here.

“Olá!” > “O” + “l” + “á” + “!” > “\u{004F}” + “\u{006C}” + “\u{00E1}” + “\u{0021}” > 4F 6C C3 A1 21. And “🦥” > “\u{01F9A5}” > F0 9F A6 A5.

Some JavaScript packages are available for common operations like retrieving a string’s size in bytes or slicing it bytewise (see string-byte-length and string-byte-slice). Otherwise, Uint8Arrays can be used to represent series of bytes and TextEncoder/TextDecoder to convert strings to/from them.

`const encoder = new TextEncoder() \n const decoder = new TextDecoder() \n const string = “🦥” \n const uint8Array = encoder.encode(string) // [0xf0, 0x9f, 0xa6, 0xa5] (UTF-8 bytes) \n decoder.decode(uint8Array) // “🦥”`

Buffers are a Node.js alternative with a few additional features.

`const string = “🦥” \n const buffer = Buffer.from(string) // [0xF0, 0x9F, 0xA6, 0xA5] \n buffer.toString() // “🦥”`

Width

When displayed, a string occupies a platform-specific width. For example, browsers use pixels, em, etc. We’ll focus on terminals, which use columns.

Terminals print characters in a grid pattern. Computing a string’s width primarily helps with vertical alignment and padding. Also, while terminals do wrap lines automatically, manual wrapping can be needed for similar reasons.

Determining a string’s terminal width is rather intricate.

To begin with, considering some terminals or fonts might not handle exotic characters well, cross-platform terminal characters should be preferred for consistent behavior.

Gnome terminal (Ubuntu) showing emoji displayed correctly — GNOME terminal (Ubuntu)

cmd.exe (Windows) showing emoji not displayed correctly — cmd.exe (Windows)

Also, while most code points are 1 column wide, fullwidth characters are 2 columns wide. Those usually represent Chinese, Japanese and Korean logograms. Common code points such as ASCII characters are sometimes available as wide or narrow variants. Unicode provides a list with each code point’s width, which can be accessed through some helper modules.

Narrow: 0123456789 ABCDEFGHIJ ｱｲｳｴｵｶｷｸｹｺ and wide: ０１２３４５６７８９ＡＢＣＤＥＦＧＨＩＪアイウエオカキクケコ

Furthermore, some code points are meant to be combined with another. Those are usually accents and other diacritics. For example, a (U+0061) followed by a combining grave accent (U+0300) is displayed like à (U+00E0) which is 1 column wide. string.normalize() composes/decomposes those.

Other examples include variation selectors: # (U+0023) succeeded by the emoji variation (U+FE0F) produces a hashtag emoji#️. Or flags: 🇪 (U+1F1EA) and 🇺 (U+1F1FA) result in the EU flag 🇪🇺.

Emoji modifiers behave similarly. For instance, 👩 (woman, U+1F469) followed by medium skin tone (U+1F3FD), zero-width joiner (U+200D) and 🔬(microscope, U+1F52C) is shown as 👩🏽‍🔬 (woman scientist, medium skin tone), 2 columns wide.

à = a + ̀ and#️ = # + VS16 and 🇪🇺 = 🇪 + 🇺 and 👩🏽‍🔬 = 👩 + 🏽 + ZWJ + 🔬

A few code points are even invisible. Among many purposes (even music notation! 🎷), those are intended to join or separate characters, symbols or words (zero-width space U+200B, word joiner U+2060) and set text direction (left-to-right mark U+200E, right-to-left mark U+200F).

Finally, control characters don’t have any width because they are not meant to be printed. Instead, they modify terminal parameters such as cursor position, scrolling, character set, communication, message structure, etc. One of them even emits sounds 🎤. They are divided into several categories:

C0 control characters (U+0000 to U+001F) which are part of ASCII. Those are the oldest ones, with some dating back to 1870 🚂! They include line feed, null and backspace. Some of them can be represented using backslash sequences such as \n, \0 or \b.
C1 control characters (U+0080 to U+009F) which are rarely used.
Other code points such as language tags.
ANSI escape sequences. Those do not have any Unicode code points. They are represented using sequences that start with \e. The most common ones change colors, e.g. \e[31m sets the font’s color to red. But many more are available. Several modules simplify detecting or stripping them.

With everything considered, manipulating a string’s width in terminals might seem daunting. Fortunately, a few packages help with computing it or slicing/truncating a string to fit a specific amount of columns.

In most cases, the difference between the above units is straightforward. Choosing the right one can prevent some bugs, such as computing a string’s terminal width using string.length or matching an emoji string with a RegExp lacking the u flag.

That being said, their performance cost might vary. For example, converting a large string to/from binary can be slow. Also, JavaScript operations using code units tend to run slightly faster than the code points ones. This can lead to preferring a less accurate unit inside critical hot paths.

In a nutshell, each unit presents the same information to different targets: machines (bytes), developers at an implementation (code units) or abstract level (code points), and users (width).

Go composable: Build apps faster like Lego

Bit is an open-source tool for building apps in a modular and collaborative way. Go composable to ship faster, more consistently, and easily scale.

→ Learn more

Build apps, pages, user-experiences and UIs as standalone components. Use them to compose new apps and experiences faster. Bring any framework and tool into your workflow. Share, reuse, and collaborate to build together.

Help your team with:

→ Micro-Frontends

→ Design Systems

→ Code-Sharing and reuse

→ Monorepos

Learn more

How We Build Micro Frontends

Building micro-frontends to speed up and scale our web development process.

blog.bitsrc.io

How we Build a Component Design System

Building a design system with components to standardize and scale our UI development process.

blog.bitsrc.io

How to reuse React components across your projects

Finally, you completed the task of creating a fantastic input field for the newsletter in your app. You are happy with…

bit.cloud

5 Ways to Build a React Monorepo

Build a production-grade React monorepo: From fast builds to code-sharing and dependencies.

blog.bitsrc.io

Bits and Pieces

How big is a JavaScript string?

The many sizes of a JS string

Code units

Code points

Bytes

Width

Go composable: Build apps faster like Lego

Learn more

How We Build Micro Frontends

Building micro-frontends to speed up and scale our web development process.

How we Build a Component Design System

Building a design system with components to standardize and scale our UI development process.

How to reuse React components across your projects

Finally, you completed the task of creating a fantastic input field for the newsletter in your app. You are happy with…

5 Ways to Build a React Monorepo

Build a production-grade React monorepo: From fast builds to code-sharing and dependencies.

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Bits and Pieces

Written by ehmicky

No responses yet