PDF → Word — Fonts: what survives the trip
PDF embeds the actual font program inside the file — every glyph the document needs ships with it. Word stores a font name and trusts the recipient’s machine to find a matching font at render time. Every font problem in PDF→Word conversion comes from this mismatch.
What the PDF carries
Each text run in a PDF references a font object that holds:
- A name (
TimesNewRomanorABCDEF+TimesNewRomanfor a subset). - A type (Type 1, TrueType, OpenType, CFF, Type 3).
- Optionally, the full font program — or a subset containing only the glyphs the document uses.
- Glyph metrics (the width of each character).
- An encoding table mapping byte codes in the content stream to glyphs.
- Flags for bold, italic, serif, monospace.
- Sometimes, no embedded program at all — just a reference to one of the standard 14 fonts (Helvetica, Times, Courier, etc.) the viewer is required to know.
Word needs different information: a font name the recipient’s system can resolve, plus separate bold / italic / size attributes outside the font reference.
The mapping pipeline
Step 1. Clean the name. A name like
ABCDEF+TimesNewRoman carries a subset prefix:
when a PDF embeds only the glyphs it actually uses (a subset of the
full font), the spec requires the embedded font name to start with a
random 6-letter uppercase tag plus a +. The tag prevents
name collisions if two documents subset the same font differently and
later get merged. Per ISO 32000, exactly six uppercase Latin letters.
Strip the prefix to get TimesNewRoman. Then normalize: drop
spaces (Times New Roman → TimesNewRoman),
normalize case, and split off style suffixes. TimesNewRoman-Bold,
Times-Italic, ArialMT-BoldItalic all decompose
into a base name plus bold/italic attributes.
Step 2. Match against a known-fonts dictionary. The converter keeps a table of common PDF font names and their system equivalents:
TimesNewRoman→ Times New Roman (Windows, macOS, most Linux).Helvetica→ Helvetica on macOS, Arial on Windows. Note: the Helvetica → Arial substitution on Windows happens at the GDI/DirectWrite layer, not in the converter — Windows ships with the alias built in.Courier→ Courier New (Windows) or Courier (macOS).Calibri,Cambria— Microsoft’s modern fonts, present on every Office install.
If the cleaned name matches an entry, use the system mapping.
Step 3. Family fallback. For unknown fonts, fall back on the FontDescriptor flags:
- Serif → Times New Roman.
- Sans-serif → Arial.
- Monospace → Courier New.
- Cursive → whatever script font happens to be installed; there is no good universal choice.
These substitutions happen inside the converter (the Helvetica → Arial alias from step 2 was the OS doing the work). The recipient sees the text in a generic family-matched font: visually different, legible.
Step 4. Embedding. .docx supports font
embedding via the “Embed fonts in the file” property. It is rarely used
because most commercial fonts forbid embedding in editable documents
under their licenses. When the source PDF embeds a font that is
legally embeddable — Source Sans Pro, Open Sans, the open Google Fonts
library — the converter can extract it and embed it into Word, and the
recipient sees the exact original face without installing anything. In
practice, embedding is off by default and most converted documents
render with substituted fonts.
Attributes carried alongside the font
For every run, the converter also transfers:
- Size in points. PDF and Word both measure in pt; this is a direct copy.
- Color. PDF can be DeviceRGB, DeviceGray, CMYK, Lab, or ICC-tagged. The converter normalizes to RGB (Word’s only color model) using color management when an ICC profile is present.
- Bold via FontDescriptor.FontWeight. The spec
defines weights 100–900 in steps of 100; normal = 400, bold = 700.
Semibold (600) is borderline — different converters draw the line
differently. Output:
<w:b/>. - Italic via FontDescriptor.ItalicAngle ≠ 0 or bit 7
(Italic) of FontDescriptor.Flags. Output:
<w:i/>. - Underline is not a font property in PDF — it is a separate line drawn beneath the glyphs. The converter has to associate the line with the run to recognize it as an underline. If it fails, the line stays in the document as an unanchored vector.
- Strikethrough — same problem, same fix.
Subset fonts and ToUnicode
Most PDFs embed a subset of the font containing only the glyphs the document uses. Subsets do not use standard Unicode encodings; glyph #27 in the subset is whatever character the original needed at that slot. To extract real text the converter must:
- Read the ToUnicode map in the font object.
- If it is missing, guess from the glyph name via the Adobe Glyph List.
- If that fails, render the glyph and OCR it. Few converters bother.
Without a ToUnicode map you get garbage in the Word file. Most modern PDFs ship correct ToUnicode tables. Archival PDFs — especially academic journals from the 1990s and 2000s — often do not. Some PDFs generated by AI tools in 2025–2026 (specific exporters in ChatGPT, Claude, and Gemini) ship with broken ToUnicode and copy out as gibberish.
Custom and CJK fonts
A niche corporate or designer font in the original will not exist on the recipient’s system. Word substitutes a fallback, line heights shift, alignment drifts. The text survives; the layout does not.
CJK is worse. Chinese, Japanese, and Korean PDFs almost always use
Type0 (CID-keyed) fonts — a two-level mapping from byte code to CID to
glyph. The converter passes the font name through
(MS Gothic, SimSun, MS PMincho).
If the recipient lacks the named font, substitution is rarely
satisfactory: CJK font metrics differ enough that the layout falls
apart.
Non-Latin coverage
Cyrillic, Greek, and Asian scripts depend on the substituted font
containing the right glyphs. A Cyrillic PDF substituted to Arial works —
Arial covers Cyrillic. A substitution that lands on a font with patchy
Cyrillic coverage shows tofu (☐) where coverage is
missing.
Character spacing
PDF can specify per-character spacing adjustments — positive or negative — to fit a line to a target width. Word supports character spacing more coarsely. Converters either ignore the small adjustments (fine for body text) or transfer them proportionally when the spacing is large enough to be intentional, like letter-spaced emphasis.
What never makes it through
- Lining vs old-style figures. PDF can specify the figure variant; Word does not distinguish.
- Ligatures (
fi,fl,ffi). Either expanded back to component letters or kept as Unicode ligatures (U+FB01), depending on settings. - Kerning. Word applies its own kerning algorithm; PDF kerning is ignored.
- OpenType features — small caps, swashes, alternates — mostly lost.
If the font in the Word output differs from the PDF, that is the default outcome, not a bug. Full visual fidelity requires embedding plus a license-permitted font, a rare combination. When exact appearance matters, leave the PDF as a PDF.