all2md.diff

Document comparison and diff functionality.

This module provides tools for comparing documents in different formats and generating unified diffs, similar to the Unix diff command but supporting any document format (PDF, DOCX, HTML, etc.).

Key Features

  • Cross-format document comparison (PDF vs DOCX, etc.)

  • Text-based comparison using Python’s difflib (guaranteed symmetric)

  • Multiple output formats (unified diff, HTML visual, JSON)

  • Optional whitespace normalization

  • Works exactly like Unix diff but for any document format

Examples

Compare two documents and get unified diff:
>>> from all2md.diff import compare_files
>>> diff_lines = compare_files("report_v1.pdf", "report_v2.pdf")
>>> for line in diff_lines:
...     print(line)
Compare with whitespace normalization:
>>> diff_lines = compare_files("doc1.docx", "doc2.docx", ignore_whitespace=True)
Render as HTML:
>>> from all2md.diff.renderers import HtmlDiffRenderer
>>> diff_lines = compare_files("doc1.pdf", "doc2.pdf")
>>> renderer = HtmlDiffRenderer()
>>> html = renderer.render(diff_lines)
class all2md.diff.DiffResult

Bases: object

Bundle diff sequences for multiple renderers.

Instances behave like an iterator over unified diff lines so existing callers can continue to iterate directly, while renderers that need richer structure (HTML/JSON) can introspect operations or raw sequences.

Store the precomputed diff sequences and metadata.

Parameters:
  • old_lines (list of str) – Extracted text lines from the original document.

  • new_lines (list of str) – Extracted text lines from the updated document.

  • old_label (str) – Label that should appear in the diff header for the original file.

  • new_label (str) – Label that should appear in the diff header for the updated file.

  • context_lines (int) – Number of context lines to include when rendering unified diffs.

  • granularity (Granularity) – Tokenisation level used to build old_lines and new_lines.

__init__(old_lines: list[str], new_lines: list[str], *, old_label: str, new_label: str, context_lines: int, granularity: Literal['block', 'sentence', 'word']) None

Store the precomputed diff sequences and metadata.

Parameters:
  • old_lines (list of str) – Extracted text lines from the original document.

  • new_lines (list of str) – Extracted text lines from the updated document.

  • old_label (str) – Label that should appear in the diff header for the original file.

  • new_label (str) – Label that should appear in the diff header for the updated file.

  • context_lines (int) – Number of context lines to include when rendering unified diffs.

  • granularity (Granularity) – Tokenisation level used to build old_lines and new_lines.

iter_unified_diff(context_lines: int | None = None) Iterator[str]

Yield unified diff lines using the cached sequences.

iter_operations() Iterator[DiffOp]

Yield SequenceMatcher operations for structured renderers.

all2md.diff.compare_documents(old_doc: Document, new_doc: Document, old_label: str = 'old', new_label: str = 'new', context_lines: int = 3, ignore_whitespace: bool = False, granularity: Literal['block', 'sentence', 'word'] = 'block') DiffResult

Compare two document ASTs and generate unified diff.

This function extracts plain text lines from both documents and uses Python’s difflib.unified_diff() to generate a standard unified diff. The result is guaranteed to be symmetric: comparing A to B produces the exact opposite of comparing B to A (with +/- swapped).

Parameters:
  • old_doc (Document) – Original document AST

  • new_doc (Document) – New document AST

  • old_label (str, default = "old") – Label for old version in diff header

  • new_label (str, default = "new") – Label for new version in diff header

  • context_lines (int, default = 3) – Number of context lines to show around changes

  • ignore_whitespace (bool, default = False) – If True, normalize whitespace before comparison

  • granularity ({'block', 'sentence', 'word'}, default = 'block') – Tokenisation level used when extracting text from each document.

  • granularity – Tokenisation level used when extracting lines from the documents.

Returns:

Diff result encapsulating sequences and render helpers

Return type:

DiffResult

all2md.diff.compare_files(old_path: str | Path, new_path: str | Path, old_label: str | None = None, new_label: str | None = None, context_lines: int = 3, ignore_whitespace: bool = False, granularity: Literal['block', 'sentence', 'word'] = 'block') DiffResult

Compare two document files and generate unified diff.

This is a convenience wrapper that loads documents from files, converts them to AST, and compares them using compare_documents().

Parameters:
  • old_path (str or Path) – Path to original document file

  • new_path (str or Path) – Path to new document file

  • old_label (str, optional) – Label for old version (defaults to filename)

  • new_label (str, optional) – Label for new version (defaults to filename)

  • context_lines (int, default = 3) – Number of context lines to show around changes

  • ignore_whitespace (bool, default = False) – If True, normalize whitespace before comparison

  • granularity (Granularity, default = 'block') – The level of granularity of details.

Returns:

Diff result encapsulating sequences and render helpers

Return type:

DiffResult

For diff module documentation, see Advanced Modules.