all2md

all2md - A Python document conversion library for bidirectional transformation.

all2md provides a comprehensive solution for converting between various file formats and Markdown. It supports PDF, Word (DOCX), PowerPoint (PPTX), HTML, email (EML), Excel (XLSX), Jupyter Notebooks (IPYNB), EPUB e-books, images, and 200+ text file formats with intelligent content extraction and formatting preservation.

The library uses a modular architecture where the main to_markdown() function automatically detects file types and routes to appropriate specialized parsers. Each converter module handles specific format requirements while maintaining consistent Markdown output with support for tables, images, and complex formatting.

Key Features

Advanced PDF parsing with table detection using PyMuPDF
Word document processing with formatting preservation
PowerPoint slide-by-slide extraction
HTML processing with configurable conversion options
Email chain parsing with attachment handling
Base64 image embedding support
Support for 200+ plaintext file formats
AST-based transformation pipeline for document manipulation
Plugin system for custom transforms via entry points

Supported Formats

Documents: PDF, DOCX, PPTX, HTML, EML, EPUB
Notebooks: IPYNB (Jupyter Notebooks)
Spreadsheets: XLSX, CSV, TSV
Images: PNG, JPEG, GIF (embedded as base64)
Text: 200+ formats including code files, configs, markup

Requirements

Python 3.10+
Optional dependencies loaded per format (PyMuPDF, python-docx, etc.)

Examples

Basic usage for file conversion:

>>> from all2md import to_markdown
>>> markdown_content = to_markdown('document.pdf')
>>> print(markdown_content)

Using AST transforms to manipulate documents:

>>> from all2md import to_markdown
>>> from all2md.transforms import RemoveImagesTransform, HeadingOffsetTransform
>>>
>>> # Apply transforms during conversion
>>> markdown = to_markdown(
...     'document.pdf',
...     transforms=[
...         RemoveImagesTransform(),
...         HeadingOffsetTransform(offset=1)
...     ]
... )

Working with the AST directly:

>>> from all2md import to_ast
>>> from all2md.transforms import render
>>>
>>> # Convert to AST
>>> doc = to_ast('document.pdf')
>>>
>>> # Apply transforms and render
>>> markdown = render(doc, transforms=['remove-images', 'heading-offset'])

See also

all2md.transforms: AST transformation system
all2md.ast: AST node definitions and utilities

all2md.to_markdown(source: str | Path | IO[bytes] | bytes | Document, *, parser_options: BaseParserOptions | None = None, renderer_options: MarkdownRendererOptions | None = None, options: BaseParserOptions | None = None, source_format: Literal['auto', 'archive', 'asciidoc', 'ast', 'bbcode', 'chm', 'csv', 'docx', 'dokuwiki', 'eml', 'enex', 'epub', 'fb2', 'html', 'ini', 'ipynb', 'jinja', 'json', 'latex', 'markdown', 'mbox', 'mediawiki', 'mhtml', 'odp', 'ods', 'odt', 'openapi', 'org', 'outlook', 'pdf', 'plaintext', 'pptx', 'rst', 'rtf', 'sourcecode', 'textile', 'toml', 'webarchive', 'xlsx', 'yaml', 'zip'] = 'auto', flavor: str | None = None, transforms: list | None = None, hooks: dict | None = None, progress_callback: Callable[[ProgressEvent], None] | None = None, remote_input_options: RemoteInputOptions | None = None, **kwargs: Any) → str

Convert document to Markdown format with enhanced format detection.

This is the main entry point for the all2md library. It can detect file formats from filenames, content analysis, or explicit format specification, then routes to the appropriate specialized converter for processing.

Parameters:

source (str, Path, IO[bytes|str], bytes, or Document) – Source document data, which can be a file path, a file-like object, raw bytes, or an AST Document object (for cases where you already have a parsed AST).
parser_options (BaseParserOptions, optional) – Pre-configured parser options for format-specific parsing settings (e.g., PdfOptions, DocxOptions, HtmlOptions).
renderer_options (BaseRendererOptions, optional) – Pre-configured renderer options for Markdown rendering settings (e.g., MarkdownOptions).
options (BaseParserOptions, optional) –

Deprecated since version Use: parser_options instead.

Deprecated alias for parser_options. Cannot be used together with parser_options.
source_format (DocumentFormat, default "auto") – Explicitly specify the source document format. If “auto”, the format is detected from the filename or content.
flavor (str, optional) – Markdown flavor/dialect to use for output. Options: “gfm”, “commonmark”, “multimarkdown”, “pandoc”, “kramdown”, “markdown_plus”. Shorthand for renderer_options=MarkdownOptions(flavor=…).
transforms (list, optional) – List of AST transforms to apply before rendering. Can be transform names (strings) or NodeTransformer instances. Transforms are applied in order. See all2md.transforms for available transforms.
hooks (dict, optional) – Transform hooks to execute during processing. Maps hook names to callable functions that execute at specific points in the transform pipeline.
progress_callback (ProgressCallback, optional) – Optional callback function for progress updates. Receives ProgressEvent objects with event_type, message, current/total counts, and metadata. See all2md.progress for details.
remote_input_options (RemoteInputOptions, optional) – Controls remote retrieval behaviour (network allowlists, size limits, etc.). Defaults to None, which disables remote fetching.
kwargs (Any) – Individual conversion options. Kwargs are intelligently split between parser and renderer based on field names. Parser-related kwargs override fields in parser_options, renderer-related kwargs override fields in renderer_options.

Returns:

Document content converted to Markdown format.

Return type:

str

Raises:

DependencyError – If required dependencies for a specific format are not installed.
ParsingError – If file processing fails due to corruption or format issues.

Examples

Basic conversion:

>>> markdown = to_markdown("document.pdf")

With parser options:

>>> pdf_opts = PdfOptions(pages=[0, 1, 2], attachment_mode="save")
>>> markdown = to_markdown("document.pdf", parser_options=pdf_opts)

With renderer options:

>>> md_opts = MarkdownRendererOptions(emphasis_symbol="_", flavor="commonmark")
>>> markdown = to_markdown("document.pdf", renderer_options=md_opts)

Using both parser and renderer options:

>>> markdown = to_markdown("doc.pdf",
...     parser_options=PdfOptions(pages=[0, 1]),
...     renderer_options=MarkdownRendererOptions(flavor="gfm"))

Using kwargs (automatically split):

>>> markdown = to_markdown("doc.pdf", pages=[0, 1], emphasis_symbol="_")

Using flavor shorthand:

>>> markdown = to_markdown("document.pdf", flavor="commonmark")

With transforms:

>>> markdown = to_markdown("doc.pdf", transforms=["remove-images"])

From AST Document:

>>> ast_doc = to_ast("document.pdf")
>>> # Apply custom processing to ast_doc...
>>> markdown = to_markdown(ast_doc)

all2md.to_ast(source: str | Path | IO[bytes] | bytes, *, parser_options: BaseParserOptions | None = None, source_format: Literal['auto', 'archive', 'asciidoc', 'ast', 'bbcode', 'chm', 'csv', 'docx', 'dokuwiki', 'eml', 'enex', 'epub', 'fb2', 'html', 'ini', 'ipynb', 'jinja', 'json', 'latex', 'markdown', 'mbox', 'mediawiki', 'mhtml', 'odp', 'ods', 'odt', 'openapi', 'org', 'outlook', 'pdf', 'plaintext', 'pptx', 'rst', 'rtf', 'sourcecode', 'textile', 'toml', 'webarchive', 'xlsx', 'yaml', 'zip'] = 'auto', progress_callback: Callable[[ProgressEvent], None] | None = None, remote_input_options: RemoteInputOptions | None = None, **kwargs: Any) → Document

Convert document to AST (Abstract Syntax Tree) format.

This function provides advanced users with direct access to the document AST, enabling custom processing, transformation, and analysis of document structure. The AST can be manipulated using utilities from all2md.ast.transforms and serialized to JSON using all2md.ast.serialization.

Parameters:

source (str, Path, IO[bytes], or bytes) – Source document data, which can be a file path, a file-like object, or raw bytes.
parser_options (BaseParserOptions, optional) – Pre-configured parser options for format-specific parsing settings (e.g., PdfOptions, DocxOptions, HtmlOptions).
source_format (DocumentFormat, default "auto") – Explicitly specify the source document format. If “auto”, the format is detected from the filename or content.
progress_callback (ProgressCallback, optional) – Optional callback function for progress updates. Receives ProgressEvent objects with event_type, message, current/total counts, and metadata. See all2md.progress for details.
remote_input_options (RemoteInputOptions, optional) – Controls remote retrieval behaviour for the source input. Defaults to None (remote fetching disabled).
kwargs (Any) – Individual parser options that override settings in parser_options.

Returns:

AST Document node representing the document structure

Return type:

Document

Raises:

FormatError – If the format cannot be detected or is unsupported
DependencyError – If required dependencies for the format are not installed
ParsingError – If conversion fails

Examples

Get AST from a document:

>>> from all2md import to_ast
>>> ast_doc = to_ast("document.pdf")

Manipulate AST and convert to markdown:

>>> from all2md.ast import transforms
>>> from all2md.renderers.markdown import MarkdownRenderer
>>> ast_doc = to_ast("document.pdf")
>>> filtered_doc = transforms.filter_nodes(ast_doc, lambda n: not isinstance(n, Image))
>>> renderer = MarkdownRenderer()
>>> markdown = renderer.render_to_string(filtered_doc)

Extract specific nodes:

>>> from all2md.ast import transforms, Heading
>>> ast_doc = to_ast("document.docx")
>>> headings = transforms.extract_nodes(ast_doc, Heading)

Serialize to JSON:

>>> from all2md.ast import serialization
>>> ast_doc = to_ast("document.html")
>>> json_str = serialization.ast_to_json(ast_doc, indent=2)

all2md.from_ast(ast_doc: Document, target_format: Literal['auto', 'archive', 'asciidoc', 'ast', 'bbcode', 'chm', 'csv', 'docx', 'dokuwiki', 'eml', 'enex', 'epub', 'fb2', 'html', 'ini', 'ipynb', 'jinja', 'json', 'latex', 'markdown', 'mbox', 'mediawiki', 'mhtml', 'odp', 'ods', 'odt', 'openapi', 'org', 'outlook', 'pdf', 'plaintext', 'pptx', 'rst', 'rtf', 'sourcecode', 'textile', 'toml', 'webarchive', 'xlsx', 'yaml', 'zip'], output: str | Path | IO[bytes] | IO[str] | None = None, *, renderer_options: BaseRendererOptions | None = None, transforms: list | None = None, hooks: dict | None = None, progress_callback: Callable[[ProgressEvent], None] | None = None, preserve_formatting: bool = False, **kwargs: Any) → None | str | bytes

Render AST document to a target format.

Parameters:

ast_doc (Document) – AST Document node to render
target_format (DocumentFormat) – Target format name (e.g., “markdown”, “docx”, “pdf”)
output (str, Path, IO[bytes], IO[str], or None, optional) – Output destination. If None, returns rendered content directly. Can be: - None: Returns str (for text formats) or bytes (for binary formats) - str or Path: Writes content to file at that path - IO[bytes]: Writes content to binary file-like object - IO[str]: Writes content to text file-like object
renderer_options (BaseRendererOptions, optional) – Renderer options for the target format
transforms (list, optional) – AST transforms to apply before rendering
hooks (dict, optional) – Transform hooks to execute during processing
progress_callback (ProgressCallback, optional) – Optional callback function for progress updates. Receives ProgressEvent objects with event_type, message, current/total counts, and metadata. See all2md.progress for details.
preserve_formatting (bool, default False) – When True and target_format is "docx", use the AST’s stashed source_path (populated by to_ast for file-based inputs) as the rendering template and clear its body before rendering. This preserves page setup, theme, headers/footers, and custom style definitions from the original document on a docx round-trip. Ignored if no source path is stashed or the caller already specified a template_path.
kwargs (Any) – Additional renderer options that override renderer_options

Returns:

None if output was specified (content written to output)
str if output=None and format is text-based (markdown, html, rst, etc.)
bytes if output=None and format is binary (docx, pdf, epub, etc.)

Return type:

None, str, or bytes

Notes

If you need a file-like object instead of direct content, pass a StringIO or BytesIO instance to the output parameter:

>>> from io import StringIO, BytesIO
>>> buffer = StringIO()
>>> from_ast(doc, "markdown", output=buffer)  # Returns None, buffer populated
>>> markdown_text = buffer.getvalue()

Examples

Render AST to string (text formats):

>>> ast_doc = to_ast("document.pdf")
>>> markdown_text = from_ast(ast_doc, "markdown")
>>> isinstance(markdown_text, str)
True

Render AST to bytes (binary formats):

>>> pdf_bytes = from_ast(ast_doc, "pdf")
>>> isinstance(pdf_bytes, bytes)
True

Render AST to file:

>>> from_ast(ast_doc, "markdown", output="output.md")

With renderer options:

>>> md_opts = MarkdownRendererOptions(flavor="commonmark")
>>> markdown_text = from_ast(ast_doc, "markdown", renderer_options=md_opts)

all2md.from_markdown(source: str | Path | IO[bytes] | IO[str], target_format: Literal['auto', 'archive', 'asciidoc', 'ast', 'bbcode', 'chm', 'csv', 'docx', 'dokuwiki', 'eml', 'enex', 'epub', 'fb2', 'html', 'ini', 'ipynb', 'jinja', 'json', 'latex', 'markdown', 'mbox', 'mediawiki', 'mhtml', 'odp', 'ods', 'odt', 'openapi', 'org', 'outlook', 'pdf', 'plaintext', 'pptx', 'rst', 'rtf', 'sourcecode', 'textile', 'toml', 'webarchive', 'xlsx', 'yaml', 'zip'], output: str | Path | IO[bytes] | IO[str] | None = None, *, parser_options: MarkdownParserOptions | None = None, renderer_options: BaseRendererOptions | None = None, transforms: list | None = None, hooks: dict | None = None, progress_callback: Callable[[ProgressEvent], None] | None = None, preserve_formatting: bool = False, **kwargs: Any) → None | str | bytes

Convert Markdown content to another format.

Parameters:

source (str, Path, IO[bytes], or IO[str]) – Markdown source content as string, file path, or file-like object
target_format (DocumentFormat) – Target format name (e.g., “docx”, “pdf”, “html”)
output (str, Path, IO[bytes], IO[str], or None, optional) – Output destination. If None, returns rendered content. Can be: - None: Returns str (for text formats) or bytes (for binary formats) - str or Path: Writes content to file at that path - IO[bytes]: Writes content to binary file-like object - IO[str]: Writes content to text file-like object
parser_options (MarkdownParserOptions, optional) – Options for parsing Markdown
renderer_options (BaseRendererOptions, optional) – Options for rendering to target format
transforms (list, optional) – AST transforms to apply
hooks (dict, optional) – Transform hooks to execute
progress_callback (ProgressCallback, optional) – Optional callback function for progress updates. Receives ProgressEvent objects with event_type, message, current/total counts, and metadata. See all2md.progress for details.
preserve_formatting (bool, default False) – When True and target_format is "docx", use the AST’s stashed source_path as a rendering template and clear its body. Only useful when the markdown source was originally derived from a docx file whose path is still available; in that case pass template_path explicitly instead. See from_ast for details.
kwargs (Any) – Additional options split between parser and renderer

Returns:

None if output was specified (content written to output)
str if output=None and format is text-based (html, rst, etc.)
bytes if output=None and format is binary (docx, pdf, epub, etc.)

Return type:

None, str, or bytes

Notes

If you need a file-like object instead of direct content, pass a StringIO or BytesIO instance to the output parameter:

>>> from io import StringIO, BytesIO
>>> buffer = StringIO()
>>> from_markdown("# Title", "html", output=buffer)  # Returns None
>>> html_text = buffer.getvalue()

Examples

Convert markdown string to HTML:

>>> html_text = from_markdown("# Title\\n\\nContent", "html")
>>> isinstance(html_text, str)
True

Convert markdown to binary format:

>>> pdf_bytes = from_markdown("# Title", "pdf")
>>> isinstance(pdf_bytes, bytes)
True

Convert markdown file to DOCX file:

>>> from_markdown("input.md", "docx", output="output.docx")

With options:

>>> html_content = from_markdown("input.md", "html",
...     parser_options=MarkdownParserOptions(flavor="gfm"),
...     renderer_options=HtmlOptions(...))

all2md.convert(source: str | Path | IO[bytes] | IO[str] | bytes, output: str | Path | IO[bytes] | IO[str] | None = None, *, parser_options: BaseParserOptions | None = None, renderer_options: BaseRendererOptions | None = None, source_format: Literal['auto', 'archive', 'asciidoc', 'ast', 'bbcode', 'chm', 'csv', 'docx', 'dokuwiki', 'eml', 'enex', 'epub', 'fb2', 'html', 'ini', 'ipynb', 'jinja', 'json', 'latex', 'markdown', 'mbox', 'mediawiki', 'mhtml', 'odp', 'ods', 'odt', 'openapi', 'org', 'outlook', 'pdf', 'plaintext', 'pptx', 'rst', 'rtf', 'sourcecode', 'textile', 'toml', 'webarchive', 'xlsx', 'yaml', 'zip'] = 'auto', target_format: Literal['auto', 'archive', 'asciidoc', 'ast', 'bbcode', 'chm', 'csv', 'docx', 'dokuwiki', 'eml', 'enex', 'epub', 'fb2', 'html', 'ini', 'ipynb', 'jinja', 'json', 'latex', 'markdown', 'mbox', 'mediawiki', 'mhtml', 'odp', 'ods', 'odt', 'openapi', 'org', 'outlook', 'pdf', 'plaintext', 'pptx', 'rst', 'rtf', 'sourcecode', 'textile', 'toml', 'webarchive', 'xlsx', 'yaml', 'zip'] = 'auto', transforms: list | None = None, hooks: dict | None = None, renderer: str | type | object | None = None, flavor: str | None = None, progress_callback: Callable[[ProgressEvent], None] | None = None, remote_input_options: RemoteInputOptions | None = None, preserve_formatting: bool = False, **kwargs: Any) → None | str | bytes

Convert between document formats.

Parameters:

source (str, Path, IO[bytes], IO[str], or bytes) – Source document (file path, file-like object, or content)
output (str, Path, IO[bytes], IO[str], or None, optional) – Output destination. If None, returns rendered content. Can be: - None: Returns str (for text formats) or bytes (for binary formats) - str or Path: Writes content to file at that path - IO[bytes]: Writes content to binary file-like object - IO[str]: Writes content to text file-like object
parser_options (BaseParserOptions, optional) – Options for parsing source format
renderer_options (BaseRendererOptions, optional) – Options for rendering target format
source_format (DocumentFormat, default "auto") – Source format (auto-detected if “auto”)
target_format (DocumentFormat, default "auto") – Target format (inferred from output or defaults to “markdown”)
transforms (list, optional) – AST transforms to apply
hooks (dict, optional) – Transform hooks to execute
renderer (str, type, or object, optional) – Custom renderer (overrides target_format)
flavor (str, optional) – Markdown flavor shorthand for renderer_options
progress_callback (ProgressCallback, optional) – Optional callback function for progress updates. Receives ProgressEvent objects with event_type, message, current/total counts, and metadata. See all2md.progress for details.
remote_input_options (RemoteInputOptions, optional) – Controls remote retrieval behaviour for the source input. Defaults to None (remote fetching disabled).
preserve_formatting (bool, default False) – When True and the target is "docx" and the source is a docx file, the rendered output uses the source as its template and the source’s body is cleared before rendering. This makes a docx round-trip (e.g. convert("in.docx", "out.docx")) preserve page setup, theme, headers/footers, and custom paragraph styles instead of regenerating a generic-looking document.
kwargs (Any) – Additional options split between parser and renderer

Returns:

None if output was specified (content written to output)
str if output=None and format is text-based (markdown, html, rst, etc.)
bytes if output=None and format is binary (docx, pdf, epub, etc.)

Return type:

None, str, or bytes

Notes

If you need a file-like object instead of direct content, pass a StringIO or BytesIO instance to the output parameter:

>>> from io import StringIO, BytesIO
>>> buffer = StringIO()
>>> convert("doc.pdf", output=buffer, target_format="markdown")  # Returns None
>>> markdown_text = buffer.getvalue()

Examples

Convert PDF to markdown:

>>> markdown_text = convert("doc.pdf", target_format="markdown")
>>> isinstance(markdown_text, str)
True

Convert to binary format:

>>> pdf_bytes = convert("input.md", target_format="pdf")
>>> isinstance(pdf_bytes, bytes)
True

Convert with output file:

>>> convert("doc.pdf", "output.md",
...     parser_options=PdfOptions(pages=[0, 1]),
...     renderer_options=MarkdownRendererOptions(flavor="commonmark"))

Bidirectional with transforms:

>>> convert("input.docx", "output.md",
...     transforms=["remove-images", "heading-offset"])

all2md.chunk(source: str | Path | IO[bytes] | bytes, *, strategy: str = 'semantic', max_tokens: int = 512, overlap: int = 0, min_tokens: int = 0, include_preamble: bool = True, heading_merge: bool = True, max_heading_level: int | None = None, avoid_table_split: bool = False, avoid_code_split: bool = False, elide_data_uris: bool = True, drop_elements: list[str] | None = None, token_counter: str = 'auto', document_id: str | None = None, source_format: Literal['auto', 'archive', 'asciidoc', 'ast', 'bbcode', 'chm', 'csv', 'docx', 'dokuwiki', 'eml', 'enex', 'epub', 'fb2', 'html', 'ini', 'ipynb', 'jinja', 'json', 'latex', 'markdown', 'mbox', 'mediawiki', 'mhtml', 'odp', 'ods', 'odt', 'openapi', 'org', 'outlook', 'pdf', 'plaintext', 'pptx', 'rst', 'rtf', 'sourcecode', 'textile', 'toml', 'webarchive', 'xlsx', 'yaml', 'zip'] = 'auto', **converter_options: Any) → list[ProvenanceChunk]

Convert a document and split it into provenance-carrying chunks in one call.

The one-call equivalent of to_ast + all2md.chunking.chunk_ast: convert source (a path, bytes, or file-like object) to an AST, optionally strip node types, and return chunks each carrying its section heading/level and — where the source format records it — the originating page span. Ideal for RAG / LLM pipelines.

Parameters:

source (str, Path, IO[bytes], or bytes) – Document to chunk (any supported format).
strategy (str) – Chunking strategy; see all2md.chunking.STRATEGIES (semantic default).
max_tokens (int) – Size controls — token budget per chunk, window overlap, and a floor below which chunks are dropped.
overlap (int) – Size controls — token budget per chunk, window overlap, and a floor below which chunks are dropped.
min_tokens (int) – Size controls — token budget per chunk, window overlap, and a floor below which chunks are dropped.
include_preamble (bool) – Structure toggles (emit pre-heading content; prepend each heading to its section’s chunks).
heading_merge (bool) – Structure toggles (emit pre-heading content; prepend each heading to its section’s chunks).
max_heading_level (int, optional) – For fine strategies, only descend into sections at or above this level.
avoid_table_split (bool) – Keep tables / fenced code blocks whole (one atomic chunk each).
avoid_code_split (bool) – Keep tables / fenced code blocks whole (one atomic chunk each).
elide_data_uris (bool) – Replace long base64 data: URIs with a short placeholder (default True).
drop_elements (list of str, optional) – AST node types to strip before chunking (e.g. ["image", "table"]).
token_counter ({"auto", "tiktoken", "whitespace"}) – Token-counting backend.
document_id (str, optional) – Identifier woven into chunk ids; defaults to the file stem (or "document").
source_format (DocumentFormat, default "auto") – Explicit source format, or auto-detect.
converter_options (Any) – Extra options forwarded to to_ast() (e.g. attachment_mode="skip", pages=[1, 2]).

Returns:

Chunks in reading order, with prev/next ids linked. Call chunk.to_dict() for a JSON-serializable record.

Return type:

list of ProvenanceChunk

Examples

>>> import all2md
>>> chunks = all2md.chunk("report.pdf", strategy="semantic", max_tokens=512, overlap=64)
>>> chunks[0].section_heading, chunks[0].page, chunks[0].token_count

all2md.confidence_report(source: str | Path | IO[bytes] | bytes | Document, *, parser_options: BaseParserOptions | None = None, source_format: Literal['auto', 'archive', 'asciidoc', 'ast', 'bbcode', 'chm', 'csv', 'docx', 'dokuwiki', 'eml', 'enex', 'epub', 'fb2', 'html', 'ini', 'ipynb', 'jinja', 'json', 'latex', 'markdown', 'mbox', 'mediawiki', 'mhtml', 'odp', 'ods', 'odt', 'openapi', 'org', 'outlook', 'pdf', 'plaintext', 'pptx', 'rst', 'rtf', 'sourcecode', 'textile', 'toml', 'webarchive', 'xlsx', 'yaml', 'zip'] = 'auto', progress_callback: Callable[[ProgressEvent], None] | None = None, remote_input_options: RemoteInputOptions | None = None, **kwargs: Any) → ConfidenceReport

Convert a document and return its conversion confidence report (“quality card”).

A reference-free read on how much to trust a conversion, built from the sanity signals converters already compute (meaningful-text density, OCR reliance, rejected tables, dropped images) plus discrete degraded-content incidents. The single 0-100 score doubles as an optimizer fitness function.

Parameters:

source (str, Path, IO[bytes], bytes, or Document) – Document to inspect. A pre-parsed Document is read directly (its report was attached when it was first parsed via to_ast()).
parser_options (BaseParserOptions, optional) – Pre-configured parser options.
source_format (DocumentFormat, default "auto") – Explicit source format, or auto-detect.
progress_callback (ProgressCallback, optional) – Optional progress callback forwarded to parsing.
remote_input_options (RemoteInputOptions, optional) – Controls remote retrieval behaviour. Defaults to None (disabled).
kwargs (Any) – Individual parser options forwarded to to_ast().

Returns:

The scored quality card. Formats that produce no scored signals and record no degraded events yield a score of 100 banded "not_assessed" – the converter ran no quality checks, so the 100 is not a clean bill of health.

Return type:

ConfidenceReport

Examples

>>> from all2md import confidence_report
>>> report = confidence_report("scan.pdf")
>>> report.score, report.band
(72, 'medium')

class all2md.ConfidenceReport

Bases: object

Structured “quality card” summarizing how much to trust a conversion.

Parameters:

score (int) – Overall confidence, 0 (untrustworthy) to 100 (no problems observed).
band ({"high", "medium", "low"}) – Coarse bucket derived from score for quick human/CLI display.
producer (str) – Label of the primary producing parser (e.g. "pdf").
signals (dict) – Continuous per-document metrics. Keys are producer-specific; common PDF keys include meaningful_chars, chars_per_page, page_count, ocr_page_fraction, tables_detected, tables_rejected, images_dropped and running_headings_demoted.
degraded_events (list of DegradedEvent) – Discrete lost/approximated-content incidents recorded during parsing.

score: int

band: Literal['high', 'medium', 'low', 'not_assessed']

producer: str

signals: dict[str, Any]

degraded_events: list[DegradedEvent]

to_dict() → dict[str, Any]: Return a JSON-safe dict suitable for Document.metadata['confidence'].

classmethod from_dict(data: dict[str, Any]) → ConfidenceReport: Reconstruct a ConfidenceReport from its to_dict() form.

__init__(score: int, band: ~typing.Literal['high', 'medium', 'low', 'not_assessed'], producer: str, signals: dict[str, ~typing.Any] = <factory>, degraded_events: list[~all2md.confidence.DegradedEvent] = <factory>) → None

class all2md.DegradedEvent

Bases: object

A single incident where a converter knowingly lost or approximated content.

Parameters:

parser (str) – Short label of the producing parser (e.g. "pdf", "archive").
kind (str) – Machine-readable event category (e.g. "table_rejected", "unparsed_member", "readability_fallback", "ocr_failed").
count (int, default = 1) – How many times this event occurred. Repeated events of the same (parser, kind, detail, severity) are coalesced with their counts summed.
detail (str or None, default = None) – Optional human-readable qualifier (e.g. the rejection reason).
severity ({"info", "warn", "error"}, default = "warn") – How much the event should weigh on the score.

parser: str

kind: str

count: int = 1

detail: str | None = None

severity: Literal['info', 'warn', 'error'] = 'warn'

to_dict() → dict[str, Any]: Return a JSON-safe dict, omitting detail when unset.

classmethod from_dict(data: dict[str, Any]) → DegradedEvent: Reconstruct a DegradedEvent from its to_dict() form.

__init__(parser: str, kind: str, count: int = 1, detail: str | None = None, severity: Literal['info', 'warn', 'error'] = 'warn') → None

all2md.roundtrip_report(source: str | Path | IO[bytes] | bytes | Document, *, via: Literal['auto', 'archive', 'asciidoc', 'ast', 'bbcode', 'chm', 'csv', 'docx', 'dokuwiki', 'eml', 'enex', 'epub', 'fb2', 'html', 'ini', 'ipynb', 'jinja', 'json', 'latex', 'markdown', 'mbox', 'mediawiki', 'mhtml', 'odp', 'ods', 'odt', 'openapi', 'org', 'outlook', 'pdf', 'plaintext', 'pptx', 'rst', 'rtf', 'sourcecode', 'textile', 'toml', 'webarchive', 'xlsx', 'yaml', 'zip'] = 'markdown', source_format: Literal['auto', 'archive', 'asciidoc', 'ast', 'bbcode', 'chm', 'csv', 'docx', 'dokuwiki', 'eml', 'enex', 'epub', 'fb2', 'html', 'ini', 'ipynb', 'jinja', 'json', 'latex', 'markdown', 'mbox', 'mediawiki', 'mhtml', 'odp', 'ods', 'odt', 'openapi', 'org', 'outlook', 'pdf', 'plaintext', 'pptx', 'rst', 'rtf', 'sourcecode', 'textile', 'toml', 'webarchive', 'xlsx', 'yaml', 'zip'] = 'auto', parser_options: BaseParserOptions | None = None, renderer_options: BaseRendererOptions | None = None, progress_callback: Callable[[ProgressEvent], None] | None = None, remote_input_options: RemoteInputOptions | None = None, **kwargs: Any) → RoundTripReport

Round-trip a document through via and score what survived.

Renders the parsed document to the via format, parses the result straight back, and compares the two ASTs structurally. Unlike confidence_report(), this has a ground truth to measure against – the source AST – so a clean document round-tripping through a lossless format scores exactly 100 and any drift is a real defect.

Parameters:

source (str, Path, IO[bytes], bytes, or Document) – Document to round-trip. A pre-parsed Document is used directly as the ground truth, in which case source_format is only a label.
via (DocumentFormat, default "markdown") – Intermediate format to round-trip through. Must have both a renderer and a parser – see roundtrippable_formats().
source_format (DocumentFormat, default "auto") – Explicit source format, or auto-detect.
parser_options (BaseParserOptions, optional) – Options for parsing the source. The intermediate is always parsed with that format’s defaults, since it is machine-generated.
renderer_options (BaseRendererOptions, optional) – Options for rendering to via.
progress_callback (ProgressCallback, optional) – Optional progress callback forwarded to the initial parse.
remote_input_options (RemoteInputOptions, optional) – Controls remote retrieval behaviour. Defaults to None (disabled).
kwargs (Any) – Individual options, split between the source parser and the via renderer.

Returns:

The 0-100 fidelity score, per-dimension metrics, and the concrete structural differences found.

Return type:

RoundTripReport

Raises:

FormatError – If via cannot be both rendered to and parsed back from.

Examples

>>> from all2md import roundtrip_report
>>> report = roundtrip_report("report.docx")
>>> report.score, report.metrics["structure"]
(94, 91)

Check what a conversion to reStructuredText would cost:

>>> report = roundtrip_report("notes.md", via="rst")

all2md.roundtrippable_formats() → list[str]

Return the formats that can be both rendered to and parsed back from.

These are the formats accepted by roundtrip_report()’s via parameter: a round trip needs a renderer to get there and a parser to get back.

class all2md.RoundTripReport

Bases: object

Structural fidelity of a parse -> render(via) -> parse round trip.

Parameters:

score (int) – Overall fidelity, 0 (nothing survived) to 100 (structurally identical).
band ({"high", "medium", "low"}) – Coarse bucket derived from score, using the same thresholds as ConfidenceReport.
source_format (str) – Format the original was parsed from (e.g. "docx").
via (str) – Format the document was round-tripped through (e.g. "markdown").
metrics (dict) – Per-dimension scores in 0-100, keyed by DIMENSION_WEIGHTS. Dimensions the source does not exercise are omitted entirely.
deltas (list of StructuralDelta) – Concrete differences found, most severe and most structural first.

score: int

band: Literal['high', 'medium', 'low', 'not_assessed']

source_format: str

via: str

metrics: dict[str, int]

deltas: list[StructuralDelta]

to_dict() → dict[str, Any]: Return a JSON-safe dict.

classmethod from_dict(data: dict[str, Any]) → RoundTripReport: Reconstruct a RoundTripReport from its to_dict() form.

__init__(score: int, band: ~typing.Literal['high', 'medium', 'low', 'not_assessed'], source_format: str, via: str, metrics: dict[str, int] = <factory>, deltas: list[~all2md.roundtrip.StructuralDelta] = <factory>) → None

class all2md.StructuralDelta

Bases: object

A single concrete difference between the original and the round trip.

Parameters:

kind (str) – Machine-readable category (e.g. "block_lost", "block_changed", "inline_lost", "table_changed", "reference_lost").
detail (str or None, default = None) – Human-readable qualifier, e.g. "heading(h2) -> paragraph".
count (int, default = 1) – How many times this delta occurred. Deltas sharing (kind, detail, severity) are coalesced with their counts summed.
severity ({"info", "warn", "error"}, default = "warn") – How serious the difference is. Purely descriptive: the score comes from the dimension metrics, not from summing delta penalties.

kind: str

detail: str | None = None

count: int = 1

severity: Literal['info', 'warn', 'error'] = 'warn'

to_dict() → dict[str, Any]: Return a JSON-safe dict, omitting detail when unset.

classmethod from_dict(data: dict[str, Any]) → StructuralDelta: Reconstruct a StructuralDelta from its to_dict() form.

__init__(kind: str, detail: str | None = None, count: int = 1, severity: Literal['info', 'warn', 'error'] = 'warn') → None

all2md.optimize_options(source: str | Path | IO[bytes] | bytes, *, source_format: Literal['auto', 'archive', 'asciidoc', 'ast', 'bbcode', 'chm', 'csv', 'docx', 'dokuwiki', 'eml', 'enex', 'epub', 'fb2', 'html', 'ini', 'ipynb', 'jinja', 'json', 'latex', 'markdown', 'mbox', 'mediawiki', 'mhtml', 'odp', 'ods', 'odt', 'openapi', 'org', 'outlook', 'pdf', 'plaintext', 'pptx', 'rst', 'rtf', 'sourcecode', 'textile', 'toml', 'webarchive', 'xlsx', 'yaml', 'zip'] = 'auto', parser_options: BaseParserOptions | None = None, rounds: int = 1, include_presets: bool = True, sample_pages: int | None = None, remote_input_options: RemoteInputOptions | None = None) → OptimizationReport

Search converter options for the settings that convert source best.

Converts the document many times under different settings and ranks them by a reference-free fidelity objective (see all2md.optimize), so this works on the documents that need it most: the ones with no known-good output to compare against. Returns the winning options as a diff from the defaults, ready to drop into an .all2md.toml.

This is not cheap — it is tens of conversions. Use sample_pages to tune on a slice of a long document, and enable the conversion cache (all2md.conversion_cache.use_conversion_cache()) to skip re-converting option sets already tried.

Parameters:

source (str or Path or file-like or bytes) – The document to tune against.
source_format (str, default "auto") – Override format detection.
parser_options (BaseParserOptions, optional) – Starting point for the search. Options outside the searched knobs are held fixed at whatever this specifies, so it doubles as a way to pin settings the optimizer must not touch.
rounds (int, default 1) – Coordinate-descent passes over the knobs. More rounds can recover knobs that only pay off in combination, at proportionally more conversions.
include_presets (bool, default True) – Score the named presets (quality, complete, …) before refining.
sample_pages (int, optional) – Tune against only the first N pages, so a 400-page document does not have to be reconverted in full for every candidate. Paginated formats only. Use at least 2 (ideally 3+): running headers and footers are recognized by the fact that they repeat, so a single-page sample cannot see them at all.
remote_input_options (RemoteInputOptions, optional) – Controls retrieval when source is a URL.

Returns:

The winning options, the fitness they scored, what the defaults scored, and every candidate evaluated.

Return type:

OptimizationReport

Raises:

FormatError – If the detected format has no tunable knobs.

Examples

>>> from all2md import optimize_options
>>> report = optimize_options("scanned.pdf")
>>> report.best_options
{'table_detection_mode': 'ruling', 'detect_columns': True}

all2md.optimizable_formats() → list[str]: Return the formats optimize_options() knows how to tune.

class all2md.OptimizationReport

Bases: object

The outcome of a search: what won, what it beat, and by how much.

source_format: str = ''

best_options: dict[str, Any]: The winning options, as a flat {option: value} diff from the defaults.

best_fitness: float = 0.0

baseline_fitness: float = 0.0: Fitness of the parser’s stock defaults, so the gain is legible.

candidates: list[Candidate]: Every candidate evaluated, best first.

evaluated: int = 0

property gain: float: How much fitness the winning options add over the stock defaults.

property improved: bool: Whether the search beat the defaults at all.

to_dict() → dict[str, Any]: Return a JSON-serializable view.

__init__(source_format: str = '', best_options: dict[str, ~typing.Any]=<factory>, best_fitness: float = 0.0, baseline_fitness: float = 0.0, candidates: list[Candidate] = <factory>, evaluated: int = 0) → None

class all2md.Candidate

Bases: object

One point in the option space, with its measured yield and fitness.

options: dict[str, Any]: The options that differ from the parser’s defaults, e.g. {"detect_columns": True}.

origin: str = 'default'

"default", "preset:quality", "refine:detect_columns".

Type:: Where the candidate came from

metrics: DocumentMetrics

fitness: float = 0.0: Pool-relative fitness, 0-100. Only comparable within one run.

dimensions: dict[str, float]: Per-dimension contributions, for explaining why a candidate won.

to_dict() → dict[str, Any]: Return a JSON-serializable view.

__init__(options: dict[str, ~typing.Any]=<factory>, origin: str = 'default', metrics: DocumentMetrics = <factory>, fitness: float = 0.0, dimensions: dict[str, float]=<factory>) → None

class all2md.DocumentMetrics

Bases: object

The reference-free structural yield of one conversion.

Everything here is read off the parsed AST, except breakage, which comes from the confidence report’s degraded-content incidents.

blocks: int = 0

words: int = 0: Words in the document, counting every block.

boilerplate_words: int = 0: Words belonging to repeated furniture (running headers, page footers), whether they sit in their own block or were glued into a body block.

unique_words: int = 0

the body content actually recovered. This, not words, is what the text dimension scores.

Type:: words minus the furniture

duplicate_blocks: int = 0: Blocks whose text is entirely a repeat of another block.

headings: int = 0

list_items: int = 0

links: int = 0

tables: int = 0

table_cells: int = 0

good_cells: float = 0.0: Filled cells discounted by shape regularity, summed over tables with 2+ columns. Quality-weighted recall: this, not table_quality, is what the objective scores. Well-formedness alone rewards missing real tables; count alone rewards inventing them.

table_fill: float = 0.0: Fraction of table cells that are non-empty. A hallucinated table is sparse.

table_regularity: float = 0.0: Fraction of table rows whose column count matches the table’s modal count. A hallucinated table is ragged.

breakage: float = 0.0

how much real breakage the converter reported.

Type:: 100 - confidence.score

block_texts: list[str]: The document’s top-level block texts. Kept so the furniture found by any candidate can be re-applied to this one – see score_candidates().

furniture: Furniture: The repeated content this parse revealed.

min_furniture_blocks: int = 2: How many distinct blocks a sequence must span before it counts as furniture. Scales with page count: furniture repeats page after page, ordinary prose does not.

property table_quality: float: Well-formedness of the tables found, 0.0-1.0. Zero if there are none.

to_dict() → dict[str, Any]: Return a JSON-serializable view, including the derived table quality.

__init__(blocks: int = 0, words: int = 0, boilerplate_words: int = 0, unique_words: int = 0, duplicate_blocks: int = 0, headings: int = 0, list_items: int = 0, links: int = 0, tables: int = 0, table_cells: int = 0, good_cells: float = 0.0, table_fill: float = 0.0, table_regularity: float = 0.0, breakage: float = 0.0, block_texts: list[str] = <factory>, furniture: Furniture = <factory>, min_furniture_blocks: int = 2) → None

class all2md.ProgressEvent

Bases: object

Progress event for document conversion operations.

This class represents a single progress event emitted during document conversion. Events use a canonical set of event types with documented semantics to ensure predictable external integration and consistent progress reporting.

Parameters:

event_type (EventType) –
Type of progress event (canonical types):
- ”started”: Conversion/parsing has begun
  Use at the start of any conversion operation. Set total to expected number of items if known.
- ”item_done”: A discrete unit has been completed
  Generic event for any completed unit: page, section, file, stage, etc. Use metadata[“item_type”] to specify what was completed. Examples: page, slide, section, tokenization, preamble, structure
- ”detected”: Something discovered during parsing
  Use when finding notable structures during parsing. Use metadata[“detected_type”] to specify what was detected. Examples: table, image, chart, heading, reference
- ”finished”: Conversion/parsing completed successfully
  Use at the end of successful conversion. Set current=total to indicate completion.
- ”error”: An error occurred during conversion
  Use when errors occur. Include details in metadata[“error”]. Conversion may continue after errors for partial results.
message (str) – Human-readable description of the event
current (int, default 0) – Current progress position (e.g., current page number, items completed)
total (int, default 0) – Total items to process (e.g., total pages). Set to 0 if unknown.
metadata (dict, default empty) –
Additional event-specific information:
- For “started”: Optional context about the operation
- For “item_done”: {“item_type”: str} - type of item completed
- For “detected”: {“detected_type”: str, additional context}
- For “error”: {“error”: str, “stage”: str, additional context}

Examples

Started event:

>>> event = ProgressEvent("started", "Converting document.pdf", current=0, total=10)

Item completed (page):

>>> event = ProgressEvent(
...     "item_done",
...     "Page 3 of 10",
...     current=3,
...     total=10,
...     metadata={"item_type": "page"}
... )

Item completed (parsing stage):

>>> event = ProgressEvent(
...     "item_done",
...     "Tokenization complete",
...     current=30,
...     total=100,
...     metadata={"item_type": "tokenization"}
... )

Structure detected:

>>> event = ProgressEvent(
...     "detected",
...     "Found 2 tables on page 5",
...     current=5,
...     total=10,
...     metadata={"detected_type": "table", "table_count": 2, "page": 5}
... )

Error

>>> event = ProgressEvent(
...     "error",
...     "Failed to parse page 7",
...     current=7,
...     total=10,
...     metadata={"error": "Invalid PDF structure", "stage": "page_parsing", "page": 7}
... )

Finished:

>>> event = ProgressEvent("finished", "Conversion complete", current=10, total=10)

Notes

Legacy event types (“page_done”, “table_detected”, “tokenization_done”, etc.) are deprecated in favor of canonical types with metadata. Parsers should migrate: - “page_done” -> “item_done” with metadata={“item_type”: “page”} - “table_detected” -> “detected” with metadata={“detected_type”: “table”} - “tokenization_done” -> “item_done” with metadata={“item_type”: “tokenization”}

event_type: Literal['started', 'item_done', 'detected', 'finished', 'error']

message: str

current: int = 0

total: int = 0

metadata: dict[str, Any]

__init__(event_type: ~typing.Literal['started', 'item_done', 'detected', 'finished', 'error'], message: str, current: int = 0, total: int = 0, metadata: dict[str, ~typing.Any] = <factory>) → None

class all2md.BaseRendererOptions

Bases: CloneFrozenMixin

Base class for all renderer options.

This class serves as the foundation for format-specific renderer options. Renderers convert AST documents into various output formats (Markdown, DOCX, PDF, etc.).

Parameters:

fail_on_resource_errors (bool, default=False) – Whether to raise RenderingError when resource loading fails (e.g., images). If False (default), warnings are logged but rendering continues. If True, rendering stops immediately on resource errors.
max_asset_size_bytes (int) – Maximum allowed size in bytes for any single asset (images, downloads, etc.)

Notes

Subclasses should define format-specific rendering options as frozen dataclass fields.

fail_on_resource_errors: bool = False

max_asset_size_bytes: int = 52428800

metadata_policy: MetadataRenderPolicy

creator: str | None = 'all2md'

__init__(fail_on_resource_errors: bool = False, max_asset_size_bytes: int = 52428800, metadata_policy: MetadataRenderPolicy = <factory>, creator: str | None = 'all2md') → None

class all2md.BaseParserOptions

Bases: CloneFrozenMixin

Base class for all parser options.

This class serves as the foundation for format-specific parser options. Parsers convert source documents into AST representation.

For parsers that handle attachments (images, downloads, etc.), also inherit from AttachmentOptionsMixin to get attachment-related configuration fields.

Parameters:: extract_metadata (bool) – Whether to extract document metadata

Notes

Subclasses should define format-specific parsing options as frozen dataclass fields.

For parsers handling binary assets (PDF, DOCX, HTML, etc.), also inherit from AttachmentOptionsMixin:

@dataclass(frozen=True)
class PdfOptions(BaseParserOptions, AttachmentOptionsMixin):
    pass

extract_metadata: bool = False

__init__(extract_metadata: bool = False) → None

class all2md.NetworkFetchOptions

Bases: CloneFrozenMixin

Network security options for remote resource fetching.

This dataclass contains settings that control how remote resources (images, CSS, etc.) are fetched, including security constraints to prevent SSRF attacks.

Parameters:

allow_remote_fetch (bool, default False) – Whether to allow fetching remote URLs for images and other resources. When False, prevents SSRF attacks by blocking all network requests.
allowed_hosts (list[str] | None, default None) – List of allowed hostnames or CIDR blocks for remote fetching. If None and allow_remote_fetch=True, all hosts are allowed, which may pose an SSRF (Server-Side Request Forgery) risk. A security warning will be logged. In security-sensitive contexts, explicitly set this to an allowlist of trusted hosts.
require_https (bool, default True) – Whether to require HTTPS for all remote URL fetching.
require_head_success (bool, default True) – Whether to require a successful HEAD request before fetching remote URLs.
network_timeout (float, default 10.0) – Timeout in seconds for remote URL fetching.
max_requests_per_second (float, default 10.0) – Maximum number of network requests per second (rate limiting).
max_concurrent_requests (int, default 5) – Maximum number of concurrent network requests.

Notes

Asset size limits are inherited from BaseParserOptions.max_asset_size_bytes.

allow_remote_fetch: bool = False

allowed_hosts: list[str] | None = None

require_https: bool = True

require_head_success: bool = True

network_timeout: float = 10.0

max_redirects: int = 5

allowed_content_types: tuple[str, ...] | None = ('image/',)

max_requests_per_second: float = 10.0

max_concurrent_requests: int = 5

__init__(allow_remote_fetch: bool = False, allowed_hosts: list[str] | None = None, require_https: bool = True, require_head_success: bool = True, network_timeout: float = 10.0, max_redirects: int = 5, allowed_content_types: tuple[str, ...] | None = ('image/',), max_requests_per_second: float = 10.0, max_concurrent_requests: int = 5) → None

class all2md.LocalFileAccessOptions

Bases: CloneFrozenMixin

Local file access security options.

This dataclass contains settings that control access to local files via file:// URLs and similar mechanisms.

Parameters:

allow_local_files (bool, default False) – Whether to allow access to local files via file:// URLs.
local_file_allowlist (list[str] | None, default None) – List of directories allowed for local file access. Only applies when allow_local_files=True.
local_file_denylist (list[str] | None, default None) – List of directories denied for local file access.
allow_cwd_files (bool, default False) – Whether to allow local files from current working directory and subdirectories.

allow_local_files: bool = False

local_file_allowlist: list[str] | None = None

local_file_denylist: list[str] | None = None

allow_cwd_files: bool = False

__init__(allow_local_files: bool = False, local_file_allowlist: list[str] | None = None, local_file_denylist: list[str] | None = None, allow_cwd_files: bool = False) → None

class all2md.HtmlRendererOptions

Bases: BaseRendererOptions

Configuration options for rendering AST to HTML format.

This dataclass contains settings specific to HTML generation, including document structure, styling, templating, and feature toggles.

Parameters:

standalone (bool, default True) – Generate complete HTML document with <html>, <head>, <body> tags. If False, generates only the content fragment. Ignored when template_mode is not None.
css_style ({"inline", "embedded", "external", "none"}, default "embedded") – How to include CSS styles: - “inline”: Add style attributes to elements - “embedded”: Include <style> block in <head> - “external”: Reference external CSS file - “none”: No styling
css_file (str or None, default None) – Path to external CSS file (used when css_style=”external”).
include_toc (bool, default False) – Generate table of contents from headings.
syntax_highlighting (bool, default True) – Add language classes to code blocks for syntax highlighting.
render_mermaid (bool, default False) – Render fenced code blocks whose language is mermaid as <pre class="mermaid"> (a hook for a client-side mermaid.js) instead of the usual <pre><code>. Used by the view/serve commands.
escape_html (bool, default True) – Escape HTML special characters in text content.
math_renderer ({"mathjax", "katex", "none"}, default "mathjax") – Math rendering library to use for MathML/LaTeX math: - “mathjax”: Include MathJax CDN script - “katex”: Include KaTeX CDN script - “none”: Render math as plain text
html_passthrough_mode ({"pass-through", "escape", "drop", "sanitize"}, default "pass-through") – How to handle HTMLBlock and HTMLInline nodes: - “pass-through”: Pass through unchanged (use only with trusted content) - “escape”: HTML-escape the content - “drop”: Remove HTML content entirely - “sanitize”: Remove dangerous elements/attributes (requires bleach for best results)
language (str, default "en") – Document language code (ISO 639-1) for the <html lang=”…”> attribute. Can be overridden by document metadata.
external_links_new_tab (bool, default False) – If True, external links (absolute http/https/ftp/mailto URLs) are rendered with target=”_blank” rel=”noopener noreferrer” so they open in a new tab. Relative and anchor links are unaffected. Used by the view/serve commands.
template_mode ({"inject", "replace", "jinja"} or None, default None) – Template mode for rendering HTML: - None: Use standalone mode (default behavior) - “inject”: Inject content into existing HTML file at selector - “replace”: Replace placeholders in template file - “jinja”: Use Jinja2 template engine with full context When set, standalone is ignored.
template_file (str or None, default None) – Path to template file (required when template_mode is not None).
template_selector (str, default "#content") – CSS selector for injection target (used with template_mode=”inject”).
toc_selector (str or None, default None) – CSS selector for separate TOC injection point (used with template_mode=”inject”). If not set, TOC is included with content at template_selector. Allows placing TOC in a different location like a sidebar or header.
injection_mode ({"append", "prepend", "replace"}, default "replace") – How to inject content at selector (used with template_mode=”inject”): - “append”: Add content after existing content - “prepend”: Add content before existing content - “replace”: Replace existing content
content_placeholder (str, default "{CONTENT}") – Placeholder string to replace with content (used with template_mode=”replace”).
css_class_map (dict[str, str | list[str]] or None, default None) – Map AST node type names to custom CSS classes. Example: {“Heading”: “article-heading”, “CodeBlock”: [“code”, “highlight”]}
allow_remote_scripts (bool, default False) – Allow loading remote scripts (e.g., MathJax/KaTeX from CDN). Default is False for security - requires explicit opt-in for CDN usage. When False and math_renderer != ‘none’, will raise a warning.
csp_enabled (bool, default False) – Add Content-Security-Policy meta tag to standalone HTML documents. Helps prevent XSS attacks by restricting resource loading.
csp_policy (str or None, default (secure policy)) – Custom Content-Security-Policy header value. If None, uses default: “default-src ‘self’; script-src ‘self’; style-src ‘self’ ‘unsafe-inline’;”
comment_mode ({"native", "visible", "ignore"}, default "native") – How to render Comment and CommentInline AST nodes: - “native”: Render as HTML comments (<!– Comment by Author: text –>) - “visible”: Render as visible <div>/<span> elements with class=”comment” and metadata in data attributes - “ignore”: Skip comment nodes entirely This controls presentation of comments from DOCX reviewer comments, source HTML comments, and other format-specific annotations.

Examples

Inject into existing HTML:

>>> options = HtmlRendererOptions(
...     template_mode="inject",
...     template_file="layout.html",
...     template_selector="#main-content"
... )

Replace placeholders:

>>> options = HtmlRendererOptions(
...     template_mode="replace",
...     template_file="template.html",
...     content_placeholder="{CONTENT}"
... )

Use Jinja2 template:

>>> options = HtmlRendererOptions(
...     template_mode="jinja",
...     template_file="article.html"
... )

Custom CSS classes:

>>> options = HtmlRendererOptions(
...     css_class_map={"Heading": "prose-heading", "CodeBlock": "code-block"}
... )

standalone: bool = True

css_style: Literal['inline', 'embedded', 'external', 'none'] = 'embedded'

css_file: str | None = None

include_toc: bool = False

syntax_highlighting: bool = True

render_mermaid: bool = False

escape_html: bool = True

math_renderer: Literal['mathjax', 'katex', 'none'] = 'mathjax'

html_passthrough_mode: Literal['pass-through', 'escape', 'drop', 'sanitize'] = 'escape'

language: str = 'en'

external_links_new_tab: bool = False

template_mode: Literal['inject', 'replace', 'jinja'] | None = None

template_file: str | None = None

template_selector: str = '#content'

toc_selector: str | None = None

injection_mode: Literal['append', 'prepend', 'replace'] = 'replace'

content_placeholder: str = '{CONTENT}'

css_class_map: dict[str, str | list[str]] | None = None

allow_remote_scripts: bool = False

csp_enabled: bool = True

csp_policy: str | None = "default-src 'self'; script-src 'self'; style-src 'self' 'unsafe-inline';"

comment_mode: Literal['native', 'visible', 'ignore'] = 'native'

__init__(fail_on_resource_errors: bool = False, max_asset_size_bytes: int = 52428800, metadata_policy: MetadataRenderPolicy = <factory>, creator: str | None = 'all2md', standalone: bool = True, css_style: CssStyle = 'embedded', css_file: str | None = None, include_toc: bool = False, syntax_highlighting: bool = True, render_mermaid: bool = False, escape_html: bool = True, math_renderer: MathRenderer = 'mathjax', html_passthrough_mode: HtmlPassthroughMode = 'escape', language: str = 'en', external_links_new_tab: bool = False, template_mode: TemplateMode | None = None, template_file: str | None = None, template_selector: str = '#content', toc_selector: str | None = None, injection_mode: InjectionMode = 'replace', content_placeholder: str = '{CONTENT}', css_class_map: dict[str, str | list[str]] | None=None, allow_remote_scripts: bool = False, csp_enabled: bool = True, csp_policy: str | None = "default-src 'self'; script-src 'self'; style-src 'self' 'unsafe-inline';", comment_mode: HtmlCommentMode = 'native') → None

class all2md.HtmlOptions

Bases: BaseParserOptions, AttachmentOptionsMixin

Configuration options for HTML-to-Markdown conversion.

This dataclass contains settings specific to HTML document processing, including heading styles, title extraction, image handling, content sanitization, and advanced formatting options. Inherits attachment handling from AttachmentOptionsMixin for images and embedded media.

Parameters:

extract_title (bool, default False) – Whether to extract and use the HTML <title> element.
convert_nbsp (bool, default False) – Whether to convert non-breaking spaces ( ) to regular spaces in the output.
strip_dangerous_elements (bool, default False) – Whether to remove potentially dangerous HTML elements (script, style, etc.) and event handler attributes (onclick, onload, etc.).
strip_framework_attributes (bool, default False) – Whether to remove JavaScript framework attributes (Alpine.js x-, Vue.js v-, Angular ng-, HTMX hx-, etc.) that can execute code in framework contexts. Only needed if output HTML will be rendered in browsers with these frameworks.
detect_table_alignment (bool, default True) – Whether to automatically detect table column alignment from CSS/attributes.
allowed_attributes (tuple[str, ...] | dict[str, tuple[str, ...]] | None, default None) – Whitelist of allowed HTML attributes. Supports two modes: - Global allowlist: tuple of attribute names applied to all elements - Per-element allowlist: dict mapping element names to tuples of allowed attributes Note: When using CLI, pass complex dict structures as JSON strings for proper parsing.
base_url (str or None, default None) – Base URL for resolving relative hrefs in <a> tags. This is separate from attachment_base_url (used for images/assets). Allows precise control over navigational link URLs vs. resource URLs.

Examples

Convert and extract page title:

>>> options = HtmlOptions(extract_title=True)

Convert with content sanitization:

>>> options = HtmlOptions(strip_dangerous_elements=True, convert_nbsp=True)

Use global attribute allowlist:

>>> options = HtmlOptions(allowed_attributes=('class', 'id', 'href', 'src'))

Use per-element attribute allowlist:

>>> options = HtmlOptions(allowed_attributes={
...     'img': ('src', 'alt', 'title'),
...     'a': ('href', 'title'),
...     'div': ('class', 'id')
... })

Extract only the readable article content:

>>> options = HtmlOptions(extract_readable=True)

extract_title: bool = False

convert_nbsp: bool = False

strip_dangerous_elements: bool = False

strip_framework_attributes: bool = False

detect_table_alignment: bool = True

network: NetworkFetchOptions

local_files: LocalFileAccessOptions

strip_comments: bool = True

collapse_whitespace: bool = True

extract_readable: bool = False

br_handling: Literal['newline', 'space'] = 'newline'

allowed_elements: tuple[str, ...] | None = None

allowed_attributes: tuple[str, ...] | dict[str, tuple[str, ...]] | None = None

figures_parsing: Literal['blockquote', 'paragraph', 'image_with_caption', 'caption_only', 'html', 'skip'] = 'blockquote'

details_parsing: Literal['blockquote', 'paragraph', 'html', 'skip'] = 'blockquote'

extract_microdata: bool = True

base_url: str | None = None

html_parser: Literal['html.parser', 'html5lib', 'lxml'] = 'html.parser'

__init__(attachment_mode: AttachmentMode = 'alt_text', alt_text_mode: AltTextMode = 'default', attachment_output_dir: str | None = None, attachment_base_url: str | None = None, max_asset_size_bytes: int = 52428800, attachment_filename_template: str = '{stem}_{type}{seq}.{ext}', attachment_overwrite: AttachmentOverwriteMode = 'unique', attachment_deduplicate_by_hash: bool = False, attachments_footnotes_section: str | None = 'Attachments', extract_metadata: bool = False, extract_title: bool = False, convert_nbsp: bool = False, strip_dangerous_elements: bool = False, strip_framework_attributes: bool = False, detect_table_alignment: bool = True, network: NetworkFetchOptions = <factory>, local_files: LocalFileAccessOptions = <factory>, strip_comments: bool = True, collapse_whitespace: bool = True, extract_readable: bool = False, br_handling: BrHandling = 'newline', allowed_elements: tuple[str, ...] | None=None, allowed_attributes: tuple[str, ...] | dict[str, tuple[str, ...]] | None=None, figures_parsing: FiguresParsing = 'blockquote', details_parsing: DetailsParsing = 'blockquote', extract_microdata: bool = True, base_url: str | None = None, html_parser: HtmlParser = 'html.parser') → None

exception all2md.DependencyError

Bases: All2MdError

Exception raised when required dependencies are not available.

This exception is raised when attempting to use a converter that requires external packages that are not installed or don’t meet version requirements.

Parameters:

converter_name (str) – Name of the converter requiring dependencies
missing_packages (list[tuple[str, str]]) – List of (package_name, version_spec) tuples for missing packages
version_mismatches (list[tuple[str, str, str]], optional) – List of (package_name, required_version, installed_version) tuples for packages with version mismatches
install_command (str, optional) – Suggested pip install command to resolve the issue
message (str, optional) – Custom error message. If not provided, generates a helpful message

Variables:

converter_name (str) – The converter that has missing dependencies
missing_packages (list[tuple[str, str]]) – Packages that need to be installed
version_mismatches (list[tuple[str, str, str]]) – Packages with version mismatches
install_command (str) – Command to install missing dependencies

Initialize the dependency error with package details.

__init__(converter_name: str, missing_packages: list[tuple[str, str]], version_mismatches: list[tuple[str, str, str]] | None = None, install_command: str = '', message: str | None = None, original_import_error: ImportError | None = None): Initialize the dependency error with package details.

exception all2md.All2MdError

Bases: Exception

Base exception class for all all2md-specific errors.

This serves as the root exception class for all custom exceptions raised by the all2md library. Catching this will catch all library-specific errors.

Parameters:

message (str) – Human-readable description of the error
original_error (Exception, optional) – The original exception that caused this error, if applicable

Variables:

message (str) – The error message
original_error (Exception or None) – The wrapped original exception, if any

Initialize the error with a message and optional original exception.

__init__(message: str, original_error: Exception | None = None): Initialize the error with a message and optional original exception.

exception all2md.FormatError

Bases: All2MdError

Exception raised when attempting to process an unsupported file format.

This exception indicates that the requested file format or conversion operation is not supported by the current version of all2md.

Parameters:

message (str, optional) – Custom error message
format_type (str, optional) – The unsupported format type (file extension or MIME type)
supported_formats (list[str], optional) – List of supported formats for reference
original_error (Exception, optional) – The original exception that caused this error

Variables:

format_type (str or None) – The format that was not supported
supported_formats (list[str] or None) – Available supported formats

Initialize the format error.

__init__(message: str | None = None, format_type: str | None = None, supported_formats: list[str] | None = None, original_error: Exception | None = None): Initialize the format error.

exception all2md.ParsingError

Bases: All2MdError

Exception raised when document parsing fails.

This exception is raised when the parsing process encounters an error that prevents successful completion, such as: - Malformed document structure - Unsupported document features - Password-protected files

Parameters:

message (str) – Description of the parsing failure
parsing_stage (str, optional) – The stage of parsing where the error occurred
original_error (Exception, optional) – The underlying exception that caused the parsing failure

Variables:

parsing_stage (str or None) – Where in the parsing process the error occurred

Initialize the parsing error.

__init__(message: str, parsing_stage: str | None = None, original_error: Exception | None = None): Initialize the parsing error.

For organized API documentation, see the API Reference which groups modules by functionality.