PDF → Word — Tables — the hardest part
A table in a PDF is a stack of horizontal and vertical lines, plus text fragments at coordinates that happen to fall inside the rectangles those lines suggest. The PDF file contains no notion of row, column, or cell. Reconstructing structure from geometry is layout reverse-engineering, and quality across tools varies by an order of magnitude.
What lives inside a “table” region
A table region contains:
- Text runs — cell contents at coordinates.
- Graphical paths — separator lines, mostly horizontal and vertical, occasionally diagonal.
- Sometimes — fills — colored rectangles for zebra striping or cell highlights.
There is no link from a line to the text near it, and no explicit grid.
The converter’s job is seven steps:
- Identify which lines are table separators (versus decoration).
- Compute their intersections — candidate cell corners.
- Build a grid from the corners.
- Bind each text run to the cell it falls inside.
- Detect merged cells.
- Validate that the result really is a table.
- Serialize it as
<w:tbl>in the Word XML.
Each step has failure modes that compound.
Step 1. Filter line candidates
The same drawing primitives produce underlines beneath headings, signature rules in forms, illustration frames, chart gridlines, and parts of vector logos. The converter filters by:
- Thickness. Real table lines are thin — 0.25 to 1 pt. Anything above ~2 pt is almost always decoration. The cutoff is tool-specific; there is no universal rule.
- Color. Decorative lines are often colored; table lines are nearly always black or dark gray. Colored lines get discarded or sent through special handling.
- Orientation. Only horizontal and vertical lines are candidates. Diagonals are not table edges.
- Length. Sub-2pt segments are usually rendering artifacts.
Lines that survive filtering become candidates.
Step 2. Sanity-check the candidate set
Two lines do not make a table. With three lines — one horizontal and two verticals — you can suggest a single-cell box, the absolute minimum. A frame of four lines that covers 50–90% of the page area is probably a page border or a large form field, not a table; ignore it.
Step 3. Find intersections efficiently
For each horizontal-vertical pair the converter checks for an intersection, with a small tolerance because PDF lines often miss meeting by a fraction of a point. Brute force is out: a 100-line page is 10,000 pairs, a 50-page document is half a million. The standard fix is a geometric index (interval tree, sweep-line, or KD-tree) so that for each horizontal line only the verticals that could possibly intersect it are considered.
The output is a cloud of candidate corners scattered across the page.
Step 4. Reconstruct the grid
From the cloud:
- Cluster intersection points by X coordinate; nearby Xs collapse into one column boundary.
- Cluster by Y; nearby Ys collapse into one row boundary.
- The clusters define an N × M grid.
The cluster-merge tolerance is the dial. Two verticals 3 pt apart: one column edge with rendering jitter, or two narrow columns? Most converters merge within 2–4 pt. Too tight and narrow columns split into several; too loose and real narrow columns disappear.
Step 5. Bind text to cells
For each text fragment, find the grid cell whose rectangle contains its lower-left corner. Multi-line cells need their row boundary detected correctly — if a row separator is missed, two visual rows become one cell with two stacked lines.
Cell-internal alignment is read from the position of runs relative to the cell rectangle. All runs flushed right? The cell is right-aligned. The Word output gets the corresponding paragraph alignment property.
Step 6. Detect merged cells
A merged cell is a region where the expected internal separator is missing. Top row has three columns, but no vertical lines run inside that row — the header is merged across all three. The detection rule: for every potential cell, check whether its internal separators exist. If they do not, merge with the neighbors out to the next existing borders.
In Word, a horizontal merge becomes
<w:gridSpan w:val="3"/>. A vertical merge uses
<w:vMerge/>: the first cell of the merge gets
<w:vMerge w:val="restart"/>, the continuing cells get
an empty <w:vMerge/>.
Step 7. Validate before writing
Before emitting the table, check three things:
- Grid regularity. Every row should have the same column count after accounting for merges.
- Fill rate. If 80% of the cells are empty, the line detection probably picked up decoration as a grid; throw the candidate away.
- Reasonable dimensions. A 1000×1000 grid on one page is an artifact, not a real table.
Failed candidates revert to ordinary paragraphs.
Word’s hard limit. Word allows at most 63 columns in one table. A wider PDF table has to be split — usually along logical column groups. The result is rarely pretty.
Step 8. Serialize
The Word XML uses <w:tbl> containing column widths
in DXA (twentieths of a point), row heights,
<w:tblBorders>, background fills, and per-cell
content alignment. Each cell is <w:tc> containing
<w:p> paragraphs.
Where good algorithms break
- Borderless tables. Data is aligned by columns but no separator lines are drawn. Line-based detection produces nothing; you need text-alignment-based detection, which is a separate, weaker algorithm. Covered in detail in the PDF→Excel series.
- Partial borders. Top and bottom horizontals only, no verticals. The intersection cloud is degenerate and the row binding is fragile.
- Nested tables. A table inside a cell. The intersection algorithm tries to merge the inner grid into the outer one and produces nonsense.
- Rotated headers. 90°-rotated column headers are common in financial reports. The text fails to bind to the cells because its bounding box does not lie flat.
- Large text-heavy cells. When a single cell contains a paragraph of body text, the converter can lose track of where the cell ends and the next one begins.
What the numbers look like
On simple tabular PDFs — standard invoices, reports with explicit borders — competent converters reach roughly 90% structural accuracy. On hard cases — magazine grids, borderless tables, tables that span page breaks — accuracy collapses to 30–50%. These are practitioner estimates, not measured benchmarks.