← Blog

PDF → Word — Fonts: what survives the trip

Font name mapping: four fallback steps PDF: "ABCDEF+TimesNewRoman-Bold" subset prefix + base + style step 1: strip subset prefix (exactly 6 uppercase + plus) → TimesNewRoman-Bold step 2: extract style suffixes (Bold, Italic) → TimesNewRoman + bold attribute step 3: dictionary of system equivalents → Times New Roman (available everywhere) step 4: family fallback (if font is unknown) serif → Times, sans-serif → Arial, monospace → Courier

PDF embeds the actual font program inside the file — every glyph the document needs ships with it. Word stores a font name and trusts the recipient’s machine to find a matching font at render time. Every font problem in PDF→Word conversion comes from this mismatch.

What the PDF carries

Each text run in a PDF references a font object that holds:

Word needs different information: a font name the recipient’s system can resolve, plus separate bold / italic / size attributes outside the font reference.

The mapping pipeline

Step 1. Clean the name. A name like ABCDEF+TimesNewRoman carries a subset prefix: when a PDF embeds only the glyphs it actually uses (a subset of the full font), the spec requires the embedded font name to start with a random 6-letter uppercase tag plus a +. The tag prevents name collisions if two documents subset the same font differently and later get merged. Per ISO 32000, exactly six uppercase Latin letters. Strip the prefix to get TimesNewRoman. Then normalize: drop spaces (Times New RomanTimesNewRoman), normalize case, and split off style suffixes. TimesNewRoman-Bold, Times-Italic, ArialMT-BoldItalic all decompose into a base name plus bold/italic attributes.

Step 2. Match against a known-fonts dictionary. The converter keeps a table of common PDF font names and their system equivalents:

If the cleaned name matches an entry, use the system mapping.

Step 3. Family fallback. For unknown fonts, fall back on the FontDescriptor flags:

These substitutions happen inside the converter (the Helvetica → Arial alias from step 2 was the OS doing the work). The recipient sees the text in a generic family-matched font: visually different, legible.

Step 4. Embedding. .docx supports font embedding via the “Embed fonts in the file” property. It is rarely used because most commercial fonts forbid embedding in editable documents under their licenses. When the source PDF embeds a font that is legally embeddable — Source Sans Pro, Open Sans, the open Google Fonts library — the converter can extract it and embed it into Word, and the recipient sees the exact original face without installing anything. In practice, embedding is off by default and most converted documents render with substituted fonts.

Attributes carried alongside the font

For every run, the converter also transfers:

Subset fonts and ToUnicode

Most PDFs embed a subset of the font containing only the glyphs the document uses. Subsets do not use standard Unicode encodings; glyph #27 in the subset is whatever character the original needed at that slot. To extract real text the converter must:

  1. Read the ToUnicode map in the font object.
  2. If it is missing, guess from the glyph name via the Adobe Glyph List.
  3. If that fails, render the glyph and OCR it. Few converters bother.

Without a ToUnicode map you get garbage in the Word file. Most modern PDFs ship correct ToUnicode tables. Archival PDFs — especially academic journals from the 1990s and 2000s — often do not. Some PDFs generated by AI tools in 2025–2026 (specific exporters in ChatGPT, Claude, and Gemini) ship with broken ToUnicode and copy out as gibberish.

Custom and CJK fonts

A niche corporate or designer font in the original will not exist on the recipient’s system. Word substitutes a fallback, line heights shift, alignment drifts. The text survives; the layout does not.

CJK is worse. Chinese, Japanese, and Korean PDFs almost always use Type0 (CID-keyed) fonts — a two-level mapping from byte code to CID to glyph. The converter passes the font name through (MS Gothic, SimSun, MS PMincho). If the recipient lacks the named font, substitution is rarely satisfactory: CJK font metrics differ enough that the layout falls apart.

Non-Latin coverage

Cyrillic, Greek, and Asian scripts depend on the substituted font containing the right glyphs. A Cyrillic PDF substituted to Arial works — Arial covers Cyrillic. A substitution that lands on a font with patchy Cyrillic coverage shows tofu () where coverage is missing.

Character spacing

PDF can specify per-character spacing adjustments — positive or negative — to fit a line to a target width. Word supports character spacing more coarsely. Converters either ignore the small adjustments (fine for body text) or transfer them proportionally when the spacing is large enough to be intentional, like letter-spaced emphasis.

What never makes it through

If the font in the Word output differs from the PDF, that is the default outcome, not a bug. Full visual fidelity requires embedding plus a license-permitted font, a rare combination. When exact appearance matters, leave the PDF as a PDF.