← Blog

PDF → Word — Why the round trip is so unreliable

PDF describes how a page looks; Word describes what a document is. PDF cm 1 0 0 1 245.7 612.3 — position Tj "H" — draw the letter cm 1 0 0 1 256.4 612.3 Tj "e" cm 1 0 0 1 263.1 612.3 Tj "l" coordinates + drawing instructions no concept of paragraph or table "how the page should look" Word (.docx) <w:document> <w:body> <w:p> — paragraph <w:r>"Hello"</w:r> </w:p> <w:tbl> — table <w:tr>...</w:tr> hierarchy, no coordinates Word decides the layout itself "what the document is made of"

Run the same PDF through three converters and you will get three broken Word files, each broken in a different way. One assembles tables but scatters the columns. Another keeps columns but folds every subheading into the body text. The third turns every visual line into its own one-line paragraph.

The two formats describe different things

PDF is a presentation format. Inside the file is a stream of drawing instructions: “place the text cursor at (245.7, 612.3) and draw the letter P in TimesNewRoman 12pt”, “draw a rectangle from (50, 500) to (550, 502) filled in black”, “render image #3 with this transformation matrix”. There is no notion of paragraph, table, or list — only graphical objects with coordinates.

Word is a structural format. Inside .docx is a hierarchy: a document contains sections, sections contain paragraphs and tables, paragraphs contain runs of identically formatted text, tables contain rows and cells, lists carry levels and numbering. There are no coordinates. Word decides what lands on which page based on the fonts and margins available at render time.

From the coordinates of a single letter you cannot derive which paragraph it belongs to. From a horizontal line you cannot tell whether it is a table border, an underline beneath a heading, or a decorative rule. Every structural decision the converter makes is a guess from visual placement.

The exception is tagged PDF: files with an internal structure tree carrying tags like <H1>, <P>, <Table>, <Caption>, usually added for accessibility (PDF/UA, PDF/A-2a or later). When the tags are present and accurate, the converter has nothing to guess. They appear in a small minority of real-world traffic. A scanned report or a PDF saved from a browser carries no tags at all.

Eleven steps, two of them mechanical

Conversion decomposes into eleven discrete operations:

  1. Parse the PDF. Extract every text run, image, and graphics path along with its coordinates. The PDF specification is published; libraries exist; this is bookkeeping.
  2. Reconstruct reading order. Sort runs the way a reader’s eye would scan — top to bottom, left to right, with corrections for columns, callouts, footnotes, and rotated text.
  3. Group runs into lines. Runs whose letters sit on the same vertical position on the page belong to the same line of text.
  4. Group lines into paragraphs. Apply proximity, shared indentation, and gap-size heuristics.
  5. Find tables. Detect grids of intersecting separators, bind text to cells.
  6. Find lists. Cluster paragraphs that begin with bullets or numbering.
  7. Find headers, footers, and footnotes. Identify blocks that repeat across pages or sit beneath a separator at page bottom.
  8. Find columns. Detect vertical white gaps that span the page; read each column to its end before moving on.
  9. Map fonts. Translate PDF font references to system fonts the recipient is likely to have.
  10. Anchor images. Decide which paragraph each picture belongs to and how it sits within the flow.
  11. Pack the .docx. Assemble a ZIP container with the right XML structure.

Steps 1 and 11 are mechanical. Steps 2 through 10 are heuristics that work on some documents and fail on others, with compounding failure modes.

Why the same converter wins one document and loses the next

Every heuristic in steps 2 through 10 boils down to a threshold. To decide whether two adjacent paragraphs should merge, the converter compares their vertical gap against a value T. With T = 1.5 line heights, one document comes out clean and the next collapses two sections into one. No T works everywhere — different designers use different leading.

Tables work the same way. Is this line a table border or decoration? Are these two near-parallel verticals one column edge with sub-pixel jitter, or two narrow columns? Tighten the thresholds and you miss real tables; loosen them and you reconstruct decorative frames as one-cell tables. Every converter chooses a different point on this trade-off curve, which is why the failure profiles look like fingerprints.