SandVox

Character Encoding in Game Localization — UTF-8, Unicode, and CJK

Game Localization · Glossary

Character Encoding in Game Localization — UTF-8, Unicode, and CJK

Character encoding is the system that maps text characters to binary values that computers can store and process. For game localization, character encoding determines which characters can appear in your game’s text, which languages your engine can render, and what happens when text is displayed across different platforms and operating systems.

ASCII vs Unicode

ASCII supports 128 characters — the English alphabet, digits, and basic punctuation. It is completely inadequate for any language outside English and basic Western European characters. Unicode is the modern standard — covering over 1.1 million characters across virtually all world writing systems. UTF-8, UTF-16, and UTF-32 are encoding forms of Unicode that determine how Unicode code points are stored as bytes. Any game shipping in Japanese, Chinese, Korean, Arabic, Russian, Thai, or Hebrew requires full Unicode support.

UTF-8 vs UTF-16 in Game Engines

UTF-8 uses variable-length encoding (1–4 bytes per character) and is backward-compatible with ASCII — it is the standard for web-based games and most modern engines. UTF-16 uses 2 bytes for most characters (the Basic Multilingual Plane covering all common writing systems) — it is common in Windows applications and is used internally by Unreal Engine. The practical difference: CJK characters require 3 bytes in UTF-8 but only 2 bytes in UTF-16, making UTF-16 slightly more storage-efficient for CJK-heavy content. UTF-8 is the recommended format for game localization file deliverables unless the engine requires otherwise.

Common Encoding Bugs in Localized Games

Mojibake: garbled characters (appearing as ?, ’, or strings of random symbols) caused by a mismatch between the encoding used to write a string and the encoding used to read it. Most commonly caused by reading UTF-8 files as ISO-8859-1 or Windows-1252. Missing glyph characters: a character exists in Unicode but not in the game’s loaded font — displayed as an empty box or fallback glyph. Byte order mark (BOM) conflicts: a UTF-8 BOM at the start of a file causes parsing errors in engines that expect clean UTF-8 without BOM. Platform-specific encoding differences: Windows, macOS, iOS, and Android handle certain edge-case Unicode characters differently.

SandVox and Character Encoding

SandVox delivers all translated string files in UTF-8 encoding unless your engine requires otherwise. Character encoding requirements are audited during scoping for CJK and right-to-left language projects — encoding bugs discovered at LocQA require string file rework; encoding bugs discovered at platform certification submission require emergency fixes against a deadline.

Related terms: Double Byte Characters · Game Internationalization · Pseudo Localization · Language Code

Frequently Asked Questions

How do I know if my game engine supports Unicode?

Most modern engines do — Unity, Unreal, and Godot all support Unicode. The test: import a UTF-8 string containing characters from your target languages (Japanese kanji, Arabic letters, Russian Cyrillic) and verify they render correctly in your game build. If characters appear as boxes or question marks, you have a font or encoding configuration issue.

What is mojibake and why does it happen?

Mojibake is garbled text — a mix of strange symbols where readable text should appear — caused by a mismatch between the encoding used to write a string and the encoding used to read it. For example, reading UTF-8 text as ISO-8859-1 produces mojibake for any non-ASCII characters. Consistent use of UTF-8 throughout your entire pipeline (engine, CAT tool, string files, build system) prevents it.

Does character encoding affect localization file sizes?

Yes, modestly. UTF-8 files for CJK content are larger than UTF-8 files for European content (3 bytes vs 1–2 bytes per character). UTF-16 files are uniformly 2x larger than ASCII-only UTF-8 files. In practice, character encoding has minimal impact on total game file size compared to audio and texture assets — but it can matter for web-based games with strict loading time budgets.

Need Expert Game Localization?

SandVox provides end-to-end game localization including character encoding — for narrative games, mobile titles, webtoons, and interactive fiction.