← Blog

PDF → Word — Tables — the hardest part

Table detection: line intersections → grid → cells lines in PDF where lines cross (cell corners) interval tree / sweep-line grid 3 columns × 3 rows = 9 cells text bound to cells by coordinates

A table in a PDF is a stack of horizontal and vertical lines, plus text fragments at coordinates that happen to fall inside the rectangles those lines suggest. The PDF file contains no notion of row, column, or cell. Reconstructing structure from geometry is layout reverse-engineering, and quality across tools varies by an order of magnitude.

What lives inside a “table” region

A table region contains:

  1. Text runs — cell contents at coordinates.
  2. Graphical paths — separator lines, mostly horizontal and vertical, occasionally diagonal.
  3. Sometimes — fills — colored rectangles for zebra striping or cell highlights.

There is no link from a line to the text near it, and no explicit grid.

The converter’s job is seven steps:

  1. Identify which lines are table separators (versus decoration).
  2. Compute their intersections — candidate cell corners.
  3. Build a grid from the corners.
  4. Bind each text run to the cell it falls inside.
  5. Detect merged cells.
  6. Validate that the result really is a table.
  7. Serialize it as <w:tbl> in the Word XML.

Each step has failure modes that compound.

Step 1. Filter line candidates

The same drawing primitives produce underlines beneath headings, signature rules in forms, illustration frames, chart gridlines, and parts of vector logos. The converter filters by:

Lines that survive filtering become candidates.

Step 2. Sanity-check the candidate set

Two lines do not make a table. With three lines — one horizontal and two verticals — you can suggest a single-cell box, the absolute minimum. A frame of four lines that covers 50–90% of the page area is probably a page border or a large form field, not a table; ignore it.

Step 3. Find intersections efficiently

For each horizontal-vertical pair the converter checks for an intersection, with a small tolerance because PDF lines often miss meeting by a fraction of a point. Brute force is out: a 100-line page is 10,000 pairs, a 50-page document is half a million. The standard fix is a geometric index (interval tree, sweep-line, or KD-tree) so that for each horizontal line only the verticals that could possibly intersect it are considered.

The output is a cloud of candidate corners scattered across the page.

Step 4. Reconstruct the grid

From the cloud:

  1. Cluster intersection points by X coordinate; nearby Xs collapse into one column boundary.
  2. Cluster by Y; nearby Ys collapse into one row boundary.
  3. The clusters define an N × M grid.

The cluster-merge tolerance is the dial. Two verticals 3 pt apart: one column edge with rendering jitter, or two narrow columns? Most converters merge within 2–4 pt. Too tight and narrow columns split into several; too loose and real narrow columns disappear.

Step 5. Bind text to cells

For each text fragment, find the grid cell whose rectangle contains its lower-left corner. Multi-line cells need their row boundary detected correctly — if a row separator is missed, two visual rows become one cell with two stacked lines.

Cell-internal alignment is read from the position of runs relative to the cell rectangle. All runs flushed right? The cell is right-aligned. The Word output gets the corresponding paragraph alignment property.

Step 6. Detect merged cells

A merged cell is a region where the expected internal separator is missing. Top row has three columns, but no vertical lines run inside that row — the header is merged across all three. The detection rule: for every potential cell, check whether its internal separators exist. If they do not, merge with the neighbors out to the next existing borders.

In Word, a horizontal merge becomes <w:gridSpan w:val="3"/>. A vertical merge uses <w:vMerge/>: the first cell of the merge gets <w:vMerge w:val="restart"/>, the continuing cells get an empty <w:vMerge/>.

Step 7. Validate before writing

Before emitting the table, check three things:

Failed candidates revert to ordinary paragraphs.

Word’s hard limit. Word allows at most 63 columns in one table. A wider PDF table has to be split — usually along logical column groups. The result is rarely pretty.

Step 8. Serialize

The Word XML uses <w:tbl> containing column widths in DXA (twentieths of a point), row heights, <w:tblBorders>, background fills, and per-cell content alignment. Each cell is <w:tc> containing <w:p> paragraphs.

Where good algorithms break

What the numbers look like

On simple tabular PDFs — standard invoices, reports with explicit borders — competent converters reach roughly 90% structural accuracy. On hard cases — magazine grids, borderless tables, tables that span page breaks — accuracy collapses to 30–50%. These are practitioner estimates, not measured benchmarks.