all2md

all2md - A Python document conversion library for bidirectional transformation.

all2md provides a comprehensive solution for converting between various file formats and Markdown. It supports PDF, Word (DOCX), PowerPoint (PPTX), HTML, email (EML), Excel (XLSX), Jupyter Notebooks (IPYNB), EPUB e-books, images, and 200+ text file formats with intelligent content extraction and formatting preservation.

The library uses a modular architecture where the main to_markdown() function automatically detects file types and routes to appropriate specialized parsers. Each converter module handles specific format requirements while maintaining consistent Markdown output with support for tables, images, and complex formatting.

Key Features

  • Advanced PDF parsing with table detection using PyMuPDF

  • Word document processing with formatting preservation

  • PowerPoint slide-by-slide extraction

  • HTML processing with configurable conversion options

  • Email chain parsing with attachment handling

  • Base64 image embedding support

  • Support for 200+ plaintext file formats

  • AST-based transformation pipeline for document manipulation

  • Plugin system for custom transforms via entry points

Supported Formats

  • Documents: PDF, DOCX, PPTX, HTML, EML, EPUB

  • Notebooks: IPYNB (Jupyter Notebooks)

  • Spreadsheets: XLSX, CSV, TSV

  • Images: PNG, JPEG, GIF (embedded as base64)

  • Text: 200+ formats including code files, configs, markup

Requirements

  • Python 3.10+

  • Optional dependencies loaded per format (PyMuPDF, python-docx, etc.)

Examples

Basic usage for file conversion:

>>> from all2md import to_markdown
>>> markdown_content = to_markdown('document.pdf')
>>> print(markdown_content)

Using AST transforms to manipulate documents:

>>> from all2md import to_markdown
>>> from all2md.transforms import RemoveImagesTransform, HeadingOffsetTransform
>>>
>>> # Apply transforms during conversion
>>> markdown = to_markdown(
...     'document.pdf',
...     transforms=[
...         RemoveImagesTransform(),
...         HeadingOffsetTransform(offset=1)
...     ]
... )

Working with the AST directly:

>>> from all2md import to_ast
>>> from all2md.transforms import render
>>>
>>> # Convert to AST
>>> doc = to_ast('document.pdf')
>>>
>>> # Apply transforms and render
>>> markdown = render(doc, transforms=['remove-images', 'heading-offset'])

See also

all2md.transforms

AST transformation system

all2md.ast

AST node definitions and utilities

all2md.to_markdown(source: str | Path | IO[bytes] | bytes | Document, *, parser_options: BaseParserOptions | None = None, renderer_options: MarkdownRendererOptions | None = None, options: BaseParserOptions | None = None, source_format: Literal['auto', 'archive', 'asciidoc', 'ast', 'bbcode', 'chm', 'csv', 'docx', 'dokuwiki', 'eml', 'enex', 'epub', 'fb2', 'html', 'ini', 'ipynb', 'jinja', 'json', 'latex', 'markdown', 'mbox', 'mediawiki', 'mhtml', 'odp', 'ods', 'odt', 'openapi', 'org', 'outlook', 'pdf', 'plaintext', 'pptx', 'rst', 'rtf', 'sourcecode', 'textile', 'toml', 'webarchive', 'xlsx', 'yaml', 'zip'] = 'auto', flavor: str | None = None, transforms: list | None = None, hooks: dict | None = None, progress_callback: Callable[[ProgressEvent], None] | None = None, remote_input_options: RemoteInputOptions | None = None, **kwargs: Any) str

Convert document to Markdown format with enhanced format detection.

This is the main entry point for the all2md library. It can detect file formats from filenames, content analysis, or explicit format specification, then routes to the appropriate specialized converter for processing.

Parameters:
  • source (str, Path, IO[bytes|str], bytes, or Document) – Source document data, which can be a file path, a file-like object, raw bytes, or an AST Document object (for cases where you already have a parsed AST).

  • parser_options (BaseParserOptions, optional) – Pre-configured parser options for format-specific parsing settings (e.g., PdfOptions, DocxOptions, HtmlOptions).

  • renderer_options (BaseRendererOptions, optional) – Pre-configured renderer options for Markdown rendering settings (e.g., MarkdownOptions).

  • options (BaseParserOptions, optional) –

    Deprecated since version Use: parser_options instead.

    Deprecated alias for parser_options. Cannot be used together with parser_options.

  • source_format (DocumentFormat, default "auto") – Explicitly specify the source document format. If “auto”, the format is detected from the filename or content.

  • flavor (str, optional) – Markdown flavor/dialect to use for output. Options: “gfm”, “commonmark”, “multimarkdown”, “pandoc”, “kramdown”, “markdown_plus”. Shorthand for renderer_options=MarkdownOptions(flavor=…).

  • transforms (list, optional) – List of AST transforms to apply before rendering. Can be transform names (strings) or NodeTransformer instances. Transforms are applied in order. See all2md.transforms for available transforms.

  • hooks (dict, optional) – Transform hooks to execute during processing. Maps hook names to callable functions that execute at specific points in the transform pipeline.

  • progress_callback (ProgressCallback, optional) – Optional callback function for progress updates. Receives ProgressEvent objects with event_type, message, current/total counts, and metadata. See all2md.progress for details.

  • remote_input_options (RemoteInputOptions, optional) – Controls remote retrieval behaviour (network allowlists, size limits, etc.). Defaults to None, which disables remote fetching.

  • kwargs (Any) – Individual conversion options. Kwargs are intelligently split between parser and renderer based on field names. Parser-related kwargs override fields in parser_options, renderer-related kwargs override fields in renderer_options.

Returns:

Document content converted to Markdown format.

Return type:

str

Raises:
  • DependencyError – If required dependencies for a specific format are not installed.

  • ParsingError – If file processing fails due to corruption or format issues.

Examples

Basic conversion:
>>> markdown = to_markdown("document.pdf")
With parser options:
>>> pdf_opts = PdfOptions(pages=[0, 1, 2], attachment_mode="save")
>>> markdown = to_markdown("document.pdf", parser_options=pdf_opts)
With renderer options:
>>> md_opts = MarkdownRendererOptions(emphasis_symbol="_", flavor="commonmark")
>>> markdown = to_markdown("document.pdf", renderer_options=md_opts)
Using both parser and renderer options:
>>> markdown = to_markdown("doc.pdf",
...     parser_options=PdfOptions(pages=[0, 1]),
...     renderer_options=MarkdownRendererOptions(flavor="gfm"))
Using kwargs (automatically split):
>>> markdown = to_markdown("doc.pdf", pages=[0, 1], emphasis_symbol="_")
Using flavor shorthand:
>>> markdown = to_markdown("document.pdf", flavor="commonmark")
With transforms:
>>> markdown = to_markdown("doc.pdf", transforms=["remove-images"])
From AST Document:
>>> ast_doc = to_ast("document.pdf")
>>> # Apply custom processing to ast_doc...
>>> markdown = to_markdown(ast_doc)
all2md.to_ast(source: str | Path | IO[bytes] | bytes, *, parser_options: BaseParserOptions | None = None, source_format: Literal['auto', 'archive', 'asciidoc', 'ast', 'bbcode', 'chm', 'csv', 'docx', 'dokuwiki', 'eml', 'enex', 'epub', 'fb2', 'html', 'ini', 'ipynb', 'jinja', 'json', 'latex', 'markdown', 'mbox', 'mediawiki', 'mhtml', 'odp', 'ods', 'odt', 'openapi', 'org', 'outlook', 'pdf', 'plaintext', 'pptx', 'rst', 'rtf', 'sourcecode', 'textile', 'toml', 'webarchive', 'xlsx', 'yaml', 'zip'] = 'auto', progress_callback: Callable[[ProgressEvent], None] | None = None, remote_input_options: RemoteInputOptions | None = None, **kwargs: Any) Document

Convert document to AST (Abstract Syntax Tree) format.

This function provides advanced users with direct access to the document AST, enabling custom processing, transformation, and analysis of document structure. The AST can be manipulated using utilities from all2md.ast.transforms and serialized to JSON using all2md.ast.serialization.

Parameters:
  • source (str, Path, IO[bytes], or bytes) – Source document data, which can be a file path, a file-like object, or raw bytes.

  • parser_options (BaseParserOptions, optional) – Pre-configured parser options for format-specific parsing settings (e.g., PdfOptions, DocxOptions, HtmlOptions).

  • source_format (DocumentFormat, default "auto") – Explicitly specify the source document format. If “auto”, the format is detected from the filename or content.

  • progress_callback (ProgressCallback, optional) – Optional callback function for progress updates. Receives ProgressEvent objects with event_type, message, current/total counts, and metadata. See all2md.progress for details.

  • remote_input_options (RemoteInputOptions, optional) – Controls remote retrieval behaviour for the source input. Defaults to None (remote fetching disabled).

  • kwargs (Any) – Individual parser options that override settings in parser_options.

Returns:

AST Document node representing the document structure

Return type:

Document

Raises:

Examples

Get AST from a document:
>>> from all2md import to_ast
>>> ast_doc = to_ast("document.pdf")
Manipulate AST and convert to markdown:
>>> from all2md.ast import transforms
>>> from all2md.renderers.markdown import MarkdownRenderer
>>> ast_doc = to_ast("document.pdf")
>>> filtered_doc = transforms.filter_nodes(ast_doc, lambda n: not isinstance(n, Image))
>>> renderer = MarkdownRenderer()
>>> markdown = renderer.render_to_string(filtered_doc)
Extract specific nodes:
>>> from all2md.ast import transforms, Heading
>>> ast_doc = to_ast("document.docx")
>>> headings = transforms.extract_nodes(ast_doc, Heading)
Serialize to JSON:
>>> from all2md.ast import serialization
>>> ast_doc = to_ast("document.html")
>>> json_str = serialization.ast_to_json(ast_doc, indent=2)
all2md.from_ast(ast_doc: Document, target_format: Literal['auto', 'archive', 'asciidoc', 'ast', 'bbcode', 'chm', 'csv', 'docx', 'dokuwiki', 'eml', 'enex', 'epub', 'fb2', 'html', 'ini', 'ipynb', 'jinja', 'json', 'latex', 'markdown', 'mbox', 'mediawiki', 'mhtml', 'odp', 'ods', 'odt', 'openapi', 'org', 'outlook', 'pdf', 'plaintext', 'pptx', 'rst', 'rtf', 'sourcecode', 'textile', 'toml', 'webarchive', 'xlsx', 'yaml', 'zip'], output: str | Path | IO[bytes] | IO[str] | None = None, *, renderer_options: BaseRendererOptions | None = None, transforms: list | None = None, hooks: dict | None = None, progress_callback: Callable[[ProgressEvent], None] | None = None, preserve_formatting: bool = False, **kwargs: Any) None | str | bytes

Render AST document to a target format.

Parameters:
  • ast_doc (Document) – AST Document node to render

  • target_format (DocumentFormat) – Target format name (e.g., “markdown”, “docx”, “pdf”)

  • output (str, Path, IO[bytes], IO[str], or None, optional) – Output destination. If None, returns rendered content directly. Can be: - None: Returns str (for text formats) or bytes (for binary formats) - str or Path: Writes content to file at that path - IO[bytes]: Writes content to binary file-like object - IO[str]: Writes content to text file-like object

  • renderer_options (BaseRendererOptions, optional) – Renderer options for the target format

  • transforms (list, optional) – AST transforms to apply before rendering

  • hooks (dict, optional) – Transform hooks to execute during processing

  • progress_callback (ProgressCallback, optional) – Optional callback function for progress updates. Receives ProgressEvent objects with event_type, message, current/total counts, and metadata. See all2md.progress for details.

  • preserve_formatting (bool, default False) – When True and target_format is "docx", use the AST’s stashed source_path (populated by to_ast for file-based inputs) as the rendering template and clear its body before rendering. This preserves page setup, theme, headers/footers, and custom style definitions from the original document on a docx round-trip. Ignored if no source path is stashed or the caller already specified a template_path.

  • kwargs (Any) – Additional renderer options that override renderer_options

Returns:

  • None if output was specified (content written to output)

  • str if output=None and format is text-based (markdown, html, rst, etc.)

  • bytes if output=None and format is binary (docx, pdf, epub, etc.)

Return type:

None, str, or bytes

Notes

If you need a file-like object instead of direct content, pass a StringIO or BytesIO instance to the output parameter:

>>> from io import StringIO, BytesIO
>>> buffer = StringIO()
>>> from_ast(doc, "markdown", output=buffer)  # Returns None, buffer populated
>>> markdown_text = buffer.getvalue()

Examples

Render AST to string (text formats):
>>> ast_doc = to_ast("document.pdf")
>>> markdown_text = from_ast(ast_doc, "markdown")
>>> isinstance(markdown_text, str)
True
Render AST to bytes (binary formats):
>>> pdf_bytes = from_ast(ast_doc, "pdf")
>>> isinstance(pdf_bytes, bytes)
True
Render AST to file:
>>> from_ast(ast_doc, "markdown", output="output.md")
With renderer options:
>>> md_opts = MarkdownRendererOptions(flavor="commonmark")
>>> markdown_text = from_ast(ast_doc, "markdown", renderer_options=md_opts)
all2md.from_markdown(source: str | Path | IO[bytes] | IO[str], target_format: Literal['auto', 'archive', 'asciidoc', 'ast', 'bbcode', 'chm', 'csv', 'docx', 'dokuwiki', 'eml', 'enex', 'epub', 'fb2', 'html', 'ini', 'ipynb', 'jinja', 'json', 'latex', 'markdown', 'mbox', 'mediawiki', 'mhtml', 'odp', 'ods', 'odt', 'openapi', 'org', 'outlook', 'pdf', 'plaintext', 'pptx', 'rst', 'rtf', 'sourcecode', 'textile', 'toml', 'webarchive', 'xlsx', 'yaml', 'zip'], output: str | Path | IO[bytes] | IO[str] | None = None, *, parser_options: MarkdownParserOptions | None = None, renderer_options: BaseRendererOptions | None = None, transforms: list | None = None, hooks: dict | None = None, progress_callback: Callable[[ProgressEvent], None] | None = None, preserve_formatting: bool = False, **kwargs: Any) None | str | bytes

Convert Markdown content to another format.

Parameters:
  • source (str, Path, IO[bytes], or IO[str]) – Markdown source content as string, file path, or file-like object

  • target_format (DocumentFormat) – Target format name (e.g., “docx”, “pdf”, “html”)

  • output (str, Path, IO[bytes], IO[str], or None, optional) – Output destination. If None, returns rendered content. Can be: - None: Returns str (for text formats) or bytes (for binary formats) - str or Path: Writes content to file at that path - IO[bytes]: Writes content to binary file-like object - IO[str]: Writes content to text file-like object

  • parser_options (MarkdownParserOptions, optional) – Options for parsing Markdown

  • renderer_options (BaseRendererOptions, optional) – Options for rendering to target format

  • transforms (list, optional) – AST transforms to apply

  • hooks (dict, optional) – Transform hooks to execute

  • progress_callback (ProgressCallback, optional) – Optional callback function for progress updates. Receives ProgressEvent objects with event_type, message, current/total counts, and metadata. See all2md.progress for details.

  • preserve_formatting (bool, default False) – When True and target_format is "docx", use the AST’s stashed source_path as a rendering template and clear its body. Only useful when the markdown source was originally derived from a docx file whose path is still available; in that case pass template_path explicitly instead. See from_ast for details.

  • kwargs (Any) – Additional options split between parser and renderer

Returns:

  • None if output was specified (content written to output)

  • str if output=None and format is text-based (html, rst, etc.)

  • bytes if output=None and format is binary (docx, pdf, epub, etc.)

Return type:

None, str, or bytes

Notes

If you need a file-like object instead of direct content, pass a StringIO or BytesIO instance to the output parameter:

>>> from io import StringIO, BytesIO
>>> buffer = StringIO()
>>> from_markdown("# Title", "html", output=buffer)  # Returns None
>>> html_text = buffer.getvalue()

Examples

Convert markdown string to HTML:
>>> html_text = from_markdown("# Title\\n\\nContent", "html")
>>> isinstance(html_text, str)
True
Convert markdown to binary format:
>>> pdf_bytes = from_markdown("# Title", "pdf")
>>> isinstance(pdf_bytes, bytes)
True
Convert markdown file to DOCX file:
>>> from_markdown("input.md", "docx", output="output.docx")
With options:
>>> html_content = from_markdown("input.md", "html",
...     parser_options=MarkdownParserOptions(flavor="gfm"),
...     renderer_options=HtmlOptions(...))
all2md.convert(source: str | Path | IO[bytes] | IO[str] | bytes, output: str | Path | IO[bytes] | IO[str] | None = None, *, parser_options: BaseParserOptions | None = None, renderer_options: BaseRendererOptions | None = None, source_format: Literal['auto', 'archive', 'asciidoc', 'ast', 'bbcode', 'chm', 'csv', 'docx', 'dokuwiki', 'eml', 'enex', 'epub', 'fb2', 'html', 'ini', 'ipynb', 'jinja', 'json', 'latex', 'markdown', 'mbox', 'mediawiki', 'mhtml', 'odp', 'ods', 'odt', 'openapi', 'org', 'outlook', 'pdf', 'plaintext', 'pptx', 'rst', 'rtf', 'sourcecode', 'textile', 'toml', 'webarchive', 'xlsx', 'yaml', 'zip'] = 'auto', target_format: Literal['auto', 'archive', 'asciidoc', 'ast', 'bbcode', 'chm', 'csv', 'docx', 'dokuwiki', 'eml', 'enex', 'epub', 'fb2', 'html', 'ini', 'ipynb', 'jinja', 'json', 'latex', 'markdown', 'mbox', 'mediawiki', 'mhtml', 'odp', 'ods', 'odt', 'openapi', 'org', 'outlook', 'pdf', 'plaintext', 'pptx', 'rst', 'rtf', 'sourcecode', 'textile', 'toml', 'webarchive', 'xlsx', 'yaml', 'zip'] = 'auto', transforms: list | None = None, hooks: dict | None = None, renderer: str | type | object | None = None, flavor: str | None = None, progress_callback: Callable[[ProgressEvent], None] | None = None, remote_input_options: RemoteInputOptions | None = None, preserve_formatting: bool = False, **kwargs: Any) None | str | bytes

Convert between document formats.

Parameters:
  • source (str, Path, IO[bytes], IO[str], or bytes) – Source document (file path, file-like object, or content)

  • output (str, Path, IO[bytes], IO[str], or None, optional) – Output destination. If None, returns rendered content. Can be: - None: Returns str (for text formats) or bytes (for binary formats) - str or Path: Writes content to file at that path - IO[bytes]: Writes content to binary file-like object - IO[str]: Writes content to text file-like object

  • parser_options (BaseParserOptions, optional) – Options for parsing source format

  • renderer_options (BaseRendererOptions, optional) – Options for rendering target format

  • source_format (DocumentFormat, default "auto") – Source format (auto-detected if “auto”)

  • target_format (DocumentFormat, default "auto") – Target format (inferred from output or defaults to “markdown”)

  • transforms (list, optional) – AST transforms to apply

  • hooks (dict, optional) – Transform hooks to execute

  • renderer (str, type, or object, optional) – Custom renderer (overrides target_format)

  • flavor (str, optional) – Markdown flavor shorthand for renderer_options

  • progress_callback (ProgressCallback, optional) – Optional callback function for progress updates. Receives ProgressEvent objects with event_type, message, current/total counts, and metadata. See all2md.progress for details.

  • remote_input_options (RemoteInputOptions, optional) – Controls remote retrieval behaviour for the source input. Defaults to None (remote fetching disabled).

  • preserve_formatting (bool, default False) – When True and the target is "docx" and the source is a docx file, the rendered output uses the source as its template and the source’s body is cleared before rendering. This makes a docx round-trip (e.g. convert("in.docx", "out.docx")) preserve page setup, theme, headers/footers, and custom paragraph styles instead of regenerating a generic-looking document.

  • kwargs (Any) – Additional options split between parser and renderer

Returns:

  • None if output was specified (content written to output)

  • str if output=None and format is text-based (markdown, html, rst, etc.)

  • bytes if output=None and format is binary (docx, pdf, epub, etc.)

Return type:

None, str, or bytes

Notes

If you need a file-like object instead of direct content, pass a StringIO or BytesIO instance to the output parameter:

>>> from io import StringIO, BytesIO
>>> buffer = StringIO()
>>> convert("doc.pdf", output=buffer, target_format="markdown")  # Returns None
>>> markdown_text = buffer.getvalue()

Examples

Convert PDF to markdown:
>>> markdown_text = convert("doc.pdf", target_format="markdown")
>>> isinstance(markdown_text, str)
True
Convert to binary format:
>>> pdf_bytes = convert("input.md", target_format="pdf")
>>> isinstance(pdf_bytes, bytes)
True
Convert with output file:
>>> convert("doc.pdf", "output.md",
...     parser_options=PdfOptions(pages=[0, 1]),
...     renderer_options=MarkdownRendererOptions(flavor="commonmark"))
Bidirectional with transforms:
>>> convert("input.docx", "output.md",
...     transforms=["remove-images", "heading-offset"])
class all2md.ProgressEvent

Bases: object

Progress event for document conversion operations.

This class represents a single progress event emitted during document conversion. Events use a canonical set of event types with documented semantics to ensure predictable external integration and consistent progress reporting.

Parameters:
  • event_type (EventType) –

    Type of progress event (canonical types):

    • ”started”: Conversion/parsing has begun

      Use at the start of any conversion operation. Set total to expected number of items if known.

    • ”item_done”: A discrete unit has been completed

      Generic event for any completed unit: page, section, file, stage, etc. Use metadata[“item_type”] to specify what was completed. Examples: page, slide, section, tokenization, preamble, structure

    • ”detected”: Something discovered during parsing

      Use when finding notable structures during parsing. Use metadata[“detected_type”] to specify what was detected. Examples: table, image, chart, heading, reference

    • ”finished”: Conversion/parsing completed successfully

      Use at the end of successful conversion. Set current=total to indicate completion.

    • ”error”: An error occurred during conversion

      Use when errors occur. Include details in metadata[“error”]. Conversion may continue after errors for partial results.

  • message (str) – Human-readable description of the event

  • current (int, default 0) – Current progress position (e.g., current page number, items completed)

  • total (int, default 0) – Total items to process (e.g., total pages). Set to 0 if unknown.

  • metadata (dict, default empty) –

    Additional event-specific information:

    • For “started”: Optional context about the operation

    • For “item_done”: {“item_type”: str} - type of item completed

    • For “detected”: {“detected_type”: str, additional context}

    • For “error”: {“error”: str, “stage”: str, additional context}

Examples

Started event:
>>> event = ProgressEvent("started", "Converting document.pdf", current=0, total=10)
Item completed (page):
>>> event = ProgressEvent(
...     "item_done",
...     "Page 3 of 10",
...     current=3,
...     total=10,
...     metadata={"item_type": "page"}
... )
Item completed (parsing stage):
>>> event = ProgressEvent(
...     "item_done",
...     "Tokenization complete",
...     current=30,
...     total=100,
...     metadata={"item_type": "tokenization"}
... )
Structure detected:
>>> event = ProgressEvent(
...     "detected",
...     "Found 2 tables on page 5",
...     current=5,
...     total=10,
...     metadata={"detected_type": "table", "table_count": 2, "page": 5}
... )

Error

>>> event = ProgressEvent(
...     "error",
...     "Failed to parse page 7",
...     current=7,
...     total=10,
...     metadata={"error": "Invalid PDF structure", "stage": "page_parsing", "page": 7}
... )
Finished:
>>> event = ProgressEvent("finished", "Conversion complete", current=10, total=10)

Notes

Legacy event types (“page_done”, “table_detected”, “tokenization_done”, etc.) are deprecated in favor of canonical types with metadata. Parsers should migrate: - “page_done” -> “item_done” with metadata={“item_type”: “page”} - “table_detected” -> “detected” with metadata={“detected_type”: “table”} - “tokenization_done” -> “item_done” with metadata={“item_type”: “tokenization”}

event_type: Literal['started', 'item_done', 'detected', 'finished', 'error']
message: str
current: int = 0
total: int = 0
metadata: dict[str, Any]
__init__(event_type: ~typing.Literal['started', 'item_done', 'detected', 'finished', 'error'], message: str, current: int = 0, total: int = 0, metadata: dict[str, ~typing.Any] = <factory>) None
class all2md.BaseRendererOptions

Bases: CloneFrozenMixin

Base class for all renderer options.

This class serves as the foundation for format-specific renderer options. Renderers convert AST documents into various output formats (Markdown, DOCX, PDF, etc.).

Parameters:
  • fail_on_resource_errors (bool, default=False) – Whether to raise RenderingError when resource loading fails (e.g., images). If False (default), warnings are logged but rendering continues. If True, rendering stops immediately on resource errors.

  • max_asset_size_bytes (int) – Maximum allowed size in bytes for any single asset (images, downloads, etc.)

Notes

Subclasses should define format-specific rendering options as frozen dataclass fields.

fail_on_resource_errors: bool = False
max_asset_size_bytes: int = 52428800
metadata_policy: MetadataRenderPolicy
creator: str | None = 'all2md'
__init__(fail_on_resource_errors: bool = False, max_asset_size_bytes: int = 52428800, metadata_policy: MetadataRenderPolicy = <factory>, creator: str | None = 'all2md') None
class all2md.BaseParserOptions

Bases: CloneFrozenMixin

Base class for all parser options.

This class serves as the foundation for format-specific parser options. Parsers convert source documents into AST representation.

For parsers that handle attachments (images, downloads, etc.), also inherit from AttachmentOptionsMixin to get attachment-related configuration fields.

Parameters:

extract_metadata (bool) – Whether to extract document metadata

Notes

Subclasses should define format-specific parsing options as frozen dataclass fields.

For parsers handling binary assets (PDF, DOCX, HTML, etc.), also inherit from AttachmentOptionsMixin:

@dataclass(frozen=True)
class PdfOptions(BaseParserOptions, AttachmentOptionsMixin):
    pass
extract_metadata: bool = False
__init__(extract_metadata: bool = False) None
class all2md.NetworkFetchOptions

Bases: CloneFrozenMixin

Network security options for remote resource fetching.

This dataclass contains settings that control how remote resources (images, CSS, etc.) are fetched, including security constraints to prevent SSRF attacks.

Parameters:
  • allow_remote_fetch (bool, default False) – Whether to allow fetching remote URLs for images and other resources. When False, prevents SSRF attacks by blocking all network requests.

  • allowed_hosts (list[str] | None, default None) – List of allowed hostnames or CIDR blocks for remote fetching. If None and allow_remote_fetch=True, all hosts are allowed, which may pose an SSRF (Server-Side Request Forgery) risk. A security warning will be logged. In security-sensitive contexts, explicitly set this to an allowlist of trusted hosts.

  • require_https (bool, default False) – Whether to require HTTPS for all remote URL fetching.

  • network_timeout (float, default 10.0) – Timeout in seconds for remote URL fetching.

  • max_requests_per_second (float, default 10.0) – Maximum number of network requests per second (rate limiting).

  • max_concurrent_requests (int, default 5) – Maximum number of concurrent network requests.

Notes

Asset size limits are inherited from BaseParserOptions.max_asset_size_bytes.

allow_remote_fetch: bool = False
allowed_hosts: list[str] | None = None
require_https: bool = True
require_head_success: bool = True
network_timeout: float = 10.0
max_redirects: int = 5
allowed_content_types: tuple[str, ...] | None = ('image/',)
max_requests_per_second: float = 10.0
max_concurrent_requests: int = 5
__init__(allow_remote_fetch: bool = False, allowed_hosts: list[str] | None = None, require_https: bool = True, require_head_success: bool = True, network_timeout: float = 10.0, max_redirects: int = 5, allowed_content_types: tuple[str, ...] | None = ('image/',), max_requests_per_second: float = 10.0, max_concurrent_requests: int = 5) None
class all2md.LocalFileAccessOptions

Bases: CloneFrozenMixin

Local file access security options.

This dataclass contains settings that control access to local files via file:// URLs and similar mechanisms.

Parameters:
  • allow_local_files (bool, default False) – Whether to allow access to local files via file:// URLs.

  • local_file_allowlist (list[str] | None, default None) – List of directories allowed for local file access. Only applies when allow_local_files=True.

  • local_file_denylist (list[str] | None, default None) – List of directories denied for local file access.

  • allow_cwd_files (bool, default False) – Whether to allow local files from current working directory and subdirectories.

allow_local_files: bool = False
local_file_allowlist: list[str] | None = None
local_file_denylist: list[str] | None = None
allow_cwd_files: bool = False
__init__(allow_local_files: bool = False, local_file_allowlist: list[str] | None = None, local_file_denylist: list[str] | None = None, allow_cwd_files: bool = False) None
class all2md.HtmlRendererOptions

Bases: BaseRendererOptions

Configuration options for rendering AST to HTML format.

This dataclass contains settings specific to HTML generation, including document structure, styling, templating, and feature toggles.

Parameters:
  • standalone (bool, default True) – Generate complete HTML document with <html>, <head>, <body> tags. If False, generates only the content fragment. Ignored when template_mode is not None.

  • css_style ({"inline", "embedded", "external", "none"}, default "embedded") – How to include CSS styles: - “inline”: Add style attributes to elements - “embedded”: Include <style> block in <head> - “external”: Reference external CSS file - “none”: No styling

  • css_file (str or None, default None) – Path to external CSS file (used when css_style=”external”).

  • include_toc (bool, default False) – Generate table of contents from headings.

  • syntax_highlighting (bool, default True) – Add language classes to code blocks for syntax highlighting.

  • escape_html (bool, default True) – Escape HTML special characters in text content.

  • math_renderer ({"mathjax", "katex", "none"}, default "mathjax") – Math rendering library to use for MathML/LaTeX math: - “mathjax”: Include MathJax CDN script - “katex”: Include KaTeX CDN script - “none”: Render math as plain text

  • html_passthrough_mode ({"pass-through", "escape", "drop", "sanitize"}, default "pass-through") – How to handle HTMLBlock and HTMLInline nodes: - “pass-through”: Pass through unchanged (use only with trusted content) - “escape”: HTML-escape the content - “drop”: Remove HTML content entirely - “sanitize”: Remove dangerous elements/attributes (requires bleach for best results)

  • language (str, default "en") – Document language code (ISO 639-1) for the <html lang=”…”> attribute. Can be overridden by document metadata.

  • template_mode ({"inject", "replace", "jinja"} or None, default None) – Template mode for rendering HTML: - None: Use standalone mode (default behavior) - “inject”: Inject content into existing HTML file at selector - “replace”: Replace placeholders in template file - “jinja”: Use Jinja2 template engine with full context When set, standalone is ignored.

  • template_file (str or None, default None) – Path to template file (required when template_mode is not None).

  • template_selector (str, default "#content") – CSS selector for injection target (used with template_mode=”inject”).

  • toc_selector (str or None, default None) – CSS selector for separate TOC injection point (used with template_mode=”inject”). If not set, TOC is included with content at template_selector. Allows placing TOC in a different location like a sidebar or header.

  • injection_mode ({"append", "prepend", "replace"}, default "replace") – How to inject content at selector (used with template_mode=”inject”): - “append”: Add content after existing content - “prepend”: Add content before existing content - “replace”: Replace existing content

  • content_placeholder (str, default "{CONTENT}") – Placeholder string to replace with content (used with template_mode=”replace”).

  • css_class_map (dict[str, str | list[str]] or None, default None) – Map AST node type names to custom CSS classes. Example: {“Heading”: “article-heading”, “CodeBlock”: [“code”, “highlight”]}

  • allow_remote_scripts (bool, default False) – Allow loading remote scripts (e.g., MathJax/KaTeX from CDN). Default is False for security - requires explicit opt-in for CDN usage. When False and math_renderer != ‘none’, will raise a warning.

  • csp_enabled (bool, default False) – Add Content-Security-Policy meta tag to standalone HTML documents. Helps prevent XSS attacks by restricting resource loading.

  • csp_policy (str or None, default (secure policy)) – Custom Content-Security-Policy header value. If None, uses default: “default-src ‘self’; script-src ‘self’; style-src ‘self’ ‘unsafe-inline’;”

  • comment_mode ({"native", "visible", "ignore"}, default "native") – How to render Comment and CommentInline AST nodes: - “native”: Render as HTML comments (<!– Comment by Author: text –>) - “visible”: Render as visible <div>/<span> elements with class=”comment” and metadata in data attributes - “ignore”: Skip comment nodes entirely This controls presentation of comments from DOCX reviewer comments, source HTML comments, and other format-specific annotations.

Examples

Inject into existing HTML:
>>> options = HtmlRendererOptions(
...     template_mode="inject",
...     template_file="layout.html",
...     template_selector="#main-content"
... )
Replace placeholders:
>>> options = HtmlRendererOptions(
...     template_mode="replace",
...     template_file="template.html",
...     content_placeholder="{CONTENT}"
... )
Use Jinja2 template:
>>> options = HtmlRendererOptions(
...     template_mode="jinja",
...     template_file="article.html"
... )
Custom CSS classes:
>>> options = HtmlRendererOptions(
...     css_class_map={"Heading": "prose-heading", "CodeBlock": "code-block"}
... )
standalone: bool = True
css_style: Literal['inline', 'embedded', 'external', 'none'] = 'embedded'
css_file: str | None = None
include_toc: bool = False
syntax_highlighting: bool = True
escape_html: bool = True
math_renderer: Literal['mathjax', 'katex', 'none'] = 'mathjax'
html_passthrough_mode: Literal['pass-through', 'escape', 'drop', 'sanitize'] = 'escape'
language: str = 'en'
template_mode: Literal['inject', 'replace', 'jinja'] | None = None
template_file: str | None = None
template_selector: str = '#content'
toc_selector: str | None = None
injection_mode: Literal['append', 'prepend', 'replace'] = 'replace'
content_placeholder: str = '{CONTENT}'
css_class_map: dict[str, str | list[str]] | None = None
allow_remote_scripts: bool = False
csp_enabled: bool = True
csp_policy: str | None = "default-src 'self'; script-src 'self'; style-src 'self' 'unsafe-inline';"
comment_mode: Literal['native', 'visible', 'ignore'] = 'native'
__init__(fail_on_resource_errors: bool = False, max_asset_size_bytes: int = 52428800, metadata_policy: MetadataRenderPolicy = <factory>, creator: str | None = 'all2md', standalone: bool = True, css_style: CssStyle = 'embedded', css_file: str | None = None, include_toc: bool = False, syntax_highlighting: bool = True, escape_html: bool = True, math_renderer: MathRenderer = 'mathjax', html_passthrough_mode: HtmlPassthroughMode = 'escape', language: str = 'en', template_mode: TemplateMode | None = None, template_file: str | None = None, template_selector: str = '#content', toc_selector: str | None = None, injection_mode: InjectionMode = 'replace', content_placeholder: str = '{CONTENT}', css_class_map: dict[str, str | list[str]] | None=None, allow_remote_scripts: bool = False, csp_enabled: bool = True, csp_policy: str | None = "default-src 'self'; script-src 'self'; style-src 'self' 'unsafe-inline';", comment_mode: HtmlCommentMode = 'native') None
class all2md.HtmlOptions

Bases: BaseParserOptions, AttachmentOptionsMixin

Configuration options for HTML-to-Markdown conversion.

This dataclass contains settings specific to HTML document processing, including heading styles, title extraction, image handling, content sanitization, and advanced formatting options. Inherits attachment handling from AttachmentOptionsMixin for images and embedded media.

Parameters:
  • extract_title (bool, default False) – Whether to extract and use the HTML <title> element.

  • convert_nbsp (bool, default False) – Whether to convert non-breaking spaces (&nbsp;) to regular spaces in the output.

  • strip_dangerous_elements (bool, default False) – Whether to remove potentially dangerous HTML elements (script, style, etc.) and event handler attributes (onclick, onload, etc.).

  • strip_framework_attributes (bool, default False) – Whether to remove JavaScript framework attributes (Alpine.js x-, Vue.js v-, Angular ng-, HTMX hx-, etc.) that can execute code in framework contexts. Only needed if output HTML will be rendered in browsers with these frameworks.

  • detect_table_alignment (bool, default True) – Whether to automatically detect table column alignment from CSS/attributes.

  • preserve_nested_structure (bool, default True) – Whether to maintain proper nesting for blockquotes and other elements.

  • allowed_attributes (tuple[str, ...] | dict[str, tuple[str, ...]] | None, default None) – Whitelist of allowed HTML attributes. Supports two modes: - Global allowlist: tuple of attribute names applied to all elements - Per-element allowlist: dict mapping element names to tuples of allowed attributes Note: When using CLI, pass complex dict structures as JSON strings for proper parsing.

  • base_url (str or None, default None) – Base URL for resolving relative hrefs in <a> tags. This is separate from attachment_base_url (used for images/assets). Allows precise control over navigational link URLs vs. resource URLs.

Examples

Convert and extract page title:
>>> options = HtmlOptions(extract_title=True)
Convert with content sanitization:
>>> options = HtmlOptions(strip_dangerous_elements=True, convert_nbsp=True)
Use global attribute allowlist:
>>> options = HtmlOptions(allowed_attributes=('class', 'id', 'href', 'src'))
Use per-element attribute allowlist:
>>> options = HtmlOptions(allowed_attributes={
...     'img': ('src', 'alt', 'title'),
...     'a': ('href', 'title'),
...     'div': ('class', 'id')
... })
Extract only the readable article content:
>>> options = HtmlOptions(extract_readable=True)
extract_title: bool = False
convert_nbsp: bool = False
strip_dangerous_elements: bool = False
strip_framework_attributes: bool = False
detect_table_alignment: bool = True
network: NetworkFetchOptions
local_files: LocalFileAccessOptions
strip_comments: bool = True
collapse_whitespace: bool = True
extract_readable: bool = False
br_handling: Literal['newline', 'space'] = 'newline'
allowed_elements: tuple[str, ...] | None = None
allowed_attributes: tuple[str, ...] | dict[str, tuple[str, ...]] | None = None
figures_parsing: Literal['blockquote', 'paragraph', 'image_with_caption', 'caption_only', 'html', 'skip'] = 'blockquote'
details_parsing: Literal['blockquote', 'paragraph', 'html', 'skip'] = 'blockquote'
extract_microdata: bool = True
base_url: str | None = None
html_parser: Literal['html.parser', 'html5lib', 'lxml'] = 'html.parser'
__init__(attachment_mode: AttachmentMode = 'alt_text', alt_text_mode: AltTextMode = 'default', attachment_output_dir: str | None = None, attachment_base_url: str | None = None, max_asset_size_bytes: int = 52428800, attachment_filename_template: str = '{stem}_{type}{seq}.{ext}', attachment_overwrite: AttachmentOverwriteMode = 'unique', attachment_deduplicate_by_hash: bool = False, attachments_footnotes_section: str | None = 'Attachments', extract_metadata: bool = False, extract_title: bool = False, convert_nbsp: bool = False, strip_dangerous_elements: bool = False, strip_framework_attributes: bool = False, detect_table_alignment: bool = True, network: NetworkFetchOptions = <factory>, local_files: LocalFileAccessOptions = <factory>, strip_comments: bool = True, collapse_whitespace: bool = True, extract_readable: bool = False, br_handling: BrHandling = 'newline', allowed_elements: tuple[str, ...] | None=None, allowed_attributes: tuple[str, ...] | dict[str, tuple[str, ...]] | None=None, figures_parsing: FiguresParsing = 'blockquote', details_parsing: DetailsParsing = 'blockquote', extract_microdata: bool = True, base_url: str | None = None, html_parser: HtmlParser = 'html.parser') None
exception all2md.DependencyError

Bases: All2MdError

Exception raised when required dependencies are not available.

This exception is raised when attempting to use a converter that requires external packages that are not installed or don’t meet version requirements.

Parameters:
  • converter_name (str) – Name of the converter requiring dependencies

  • missing_packages (list[tuple[str, str]]) – List of (package_name, version_spec) tuples for missing packages

  • version_mismatches (list[tuple[str, str, str]], optional) – List of (package_name, required_version, installed_version) tuples for packages with version mismatches

  • install_command (str, optional) – Suggested pip install command to resolve the issue

  • message (str, optional) – Custom error message. If not provided, generates a helpful message

Variables:
  • converter_name (str) – The converter that has missing dependencies

  • missing_packages (list[tuple[str, str]]) – Packages that need to be installed

  • version_mismatches (list[tuple[str, str, str]]) – Packages with version mismatches

  • install_command (str) – Command to install missing dependencies

Initialize the dependency error with package details.

__init__(converter_name: str, missing_packages: list[tuple[str, str]], version_mismatches: list[tuple[str, str, str]] | None = None, install_command: str = '', message: str | None = None, original_import_error: ImportError | None = None)

Initialize the dependency error with package details.

exception all2md.All2MdError

Bases: Exception

Base exception class for all all2md-specific errors.

This serves as the root exception class for all custom exceptions raised by the all2md library. Catching this will catch all library-specific errors.

Parameters:
  • message (str) – Human-readable description of the error

  • original_error (Exception, optional) – The original exception that caused this error, if applicable

Variables:
  • message (str) – The error message

  • original_error (Exception or None) – The wrapped original exception, if any

Initialize the error with a message and optional original exception.

__init__(message: str, original_error: Exception | None = None)

Initialize the error with a message and optional original exception.

exception all2md.FormatError

Bases: All2MdError

Exception raised when attempting to process an unsupported file format.

This exception indicates that the requested file format or conversion operation is not supported by the current version of all2md.

Parameters:
  • message (str, optional) – Custom error message

  • format_type (str, optional) – The unsupported format type (file extension or MIME type)

  • supported_formats (list[str], optional) – List of supported formats for reference

  • original_error (Exception, optional) – The original exception that caused this error

Variables:
  • format_type (str or None) – The format that was not supported

  • supported_formats (list[str] or None) – Available supported formats

Initialize the format error.

__init__(message: str | None = None, format_type: str | None = None, supported_formats: list[str] | None = None, original_error: Exception | None = None)

Initialize the format error.

exception all2md.ParsingError

Bases: All2MdError

Exception raised when document parsing fails.

This exception is raised when the parsing process encounters an error that prevents successful completion, such as: - Malformed document structure - Unsupported document features - Password-protected files

Parameters:
  • message (str) – Description of the parsing failure

  • parsing_stage (str, optional) – The stage of parsing where the error occurred

  • original_error (Exception, optional) – The underlying exception that caused the parsing failure

Variables:

parsing_stage (str or None) – Where in the parsing process the error occurred

Initialize the parsing error.

__init__(message: str, parsing_stage: str | None = None, original_error: Exception | None = None)

Initialize the parsing error.

For organized API documentation, see the API Reference which groups modules by functionality.