all2md
all2md - A Python document conversion library for bidirectional transformation.
all2md provides a comprehensive solution for converting between various file formats and Markdown. It supports PDF, Word (DOCX), PowerPoint (PPTX), HTML, email (EML), Excel (XLSX), Jupyter Notebooks (IPYNB), EPUB e-books, images, and 200+ text file formats with intelligent content extraction and formatting preservation.
The library uses a modular architecture where the main to_markdown() function automatically detects file types and routes to appropriate specialized parsers. Each converter module handles specific format requirements while maintaining consistent Markdown output with support for tables, images, and complex formatting.
Key Features
Advanced PDF parsing with table detection using PyMuPDF
Word document processing with formatting preservation
PowerPoint slide-by-slide extraction
HTML processing with configurable conversion options
Email chain parsing with attachment handling
Base64 image embedding support
Support for 200+ plaintext file formats
AST-based transformation pipeline for document manipulation
Plugin system for custom transforms via entry points
Supported Formats
Documents: PDF, DOCX, PPTX, HTML, EML, EPUB
Notebooks: IPYNB (Jupyter Notebooks)
Spreadsheets: XLSX, CSV, TSV
Images: PNG, JPEG, GIF (embedded as base64)
Text: 200+ formats including code files, configs, markup
Requirements
Python 3.10+
Optional dependencies loaded per format (PyMuPDF, python-docx, etc.)
Examples
Basic usage for file conversion:
>>> from all2md import to_markdown
>>> markdown_content = to_markdown('document.pdf')
>>> print(markdown_content)
Using AST transforms to manipulate documents:
>>> from all2md import to_markdown
>>> from all2md.transforms import RemoveImagesTransform, HeadingOffsetTransform
>>>
>>> # Apply transforms during conversion
>>> markdown = to_markdown(
... 'document.pdf',
... transforms=[
... RemoveImagesTransform(),
... HeadingOffsetTransform(offset=1)
... ]
... )
Working with the AST directly:
>>> from all2md import to_ast
>>> from all2md.transforms import render
>>>
>>> # Convert to AST
>>> doc = to_ast('document.pdf')
>>>
>>> # Apply transforms and render
>>> markdown = render(doc, transforms=['remove-images', 'heading-offset'])
See also
all2md.transformsAST transformation system
all2md.astAST node definitions and utilities
- all2md.to_markdown(source: str | Path | IO[bytes] | bytes | Document, *, parser_options: BaseParserOptions | None = None, renderer_options: MarkdownRendererOptions | None = None, options: BaseParserOptions | None = None, source_format: Literal['auto', 'archive', 'asciidoc', 'ast', 'bbcode', 'chm', 'csv', 'docx', 'dokuwiki', 'eml', 'enex', 'epub', 'fb2', 'html', 'ini', 'ipynb', 'jinja', 'json', 'latex', 'markdown', 'mbox', 'mediawiki', 'mhtml', 'odp', 'ods', 'odt', 'openapi', 'org', 'outlook', 'pdf', 'plaintext', 'pptx', 'rst', 'rtf', 'sourcecode', 'textile', 'toml', 'webarchive', 'xlsx', 'yaml', 'zip'] = 'auto', flavor: str | None = None, transforms: list | None = None, hooks: dict | None = None, progress_callback: Callable[[ProgressEvent], None] | None = None, remote_input_options: RemoteInputOptions | None = None, **kwargs: Any) str
Convert document to Markdown format with enhanced format detection.
This is the main entry point for the all2md library. It can detect file formats from filenames, content analysis, or explicit format specification, then routes to the appropriate specialized converter for processing.
- Parameters:
source (str, Path, IO[bytes|str], bytes, or Document) – Source document data, which can be a file path, a file-like object, raw bytes, or an AST Document object (for cases where you already have a parsed AST).
parser_options (BaseParserOptions, optional) – Pre-configured parser options for format-specific parsing settings (e.g., PdfOptions, DocxOptions, HtmlOptions).
renderer_options (BaseRendererOptions, optional) – Pre-configured renderer options for Markdown rendering settings (e.g., MarkdownOptions).
options (BaseParserOptions, optional) –
Deprecated since version Use:
parser_optionsinstead.Deprecated alias for
parser_options. Cannot be used together withparser_options.source_format (DocumentFormat, default "auto") – Explicitly specify the source document format. If “auto”, the format is detected from the filename or content.
flavor (str, optional) – Markdown flavor/dialect to use for output. Options: “gfm”, “commonmark”, “multimarkdown”, “pandoc”, “kramdown”, “markdown_plus”. Shorthand for renderer_options=MarkdownOptions(flavor=…).
transforms (list, optional) – List of AST transforms to apply before rendering. Can be transform names (strings) or NodeTransformer instances. Transforms are applied in order. See all2md.transforms for available transforms.
hooks (dict, optional) – Transform hooks to execute during processing. Maps hook names to callable functions that execute at specific points in the transform pipeline.
progress_callback (ProgressCallback, optional) – Optional callback function for progress updates. Receives ProgressEvent objects with event_type, message, current/total counts, and metadata. See all2md.progress for details.
remote_input_options (RemoteInputOptions, optional) – Controls remote retrieval behaviour (network allowlists, size limits, etc.). Defaults to None, which disables remote fetching.
kwargs (Any) – Individual conversion options. Kwargs are intelligently split between parser and renderer based on field names. Parser-related kwargs override fields in parser_options, renderer-related kwargs override fields in renderer_options.
- Returns:
Document content converted to Markdown format.
- Return type:
str
- Raises:
DependencyError – If required dependencies for a specific format are not installed.
ParsingError – If file processing fails due to corruption or format issues.
Examples
- Basic conversion:
>>> markdown = to_markdown("document.pdf")
- With parser options:
>>> pdf_opts = PdfOptions(pages=[0, 1, 2], attachment_mode="save") >>> markdown = to_markdown("document.pdf", parser_options=pdf_opts)
- With renderer options:
>>> md_opts = MarkdownRendererOptions(emphasis_symbol="_", flavor="commonmark") >>> markdown = to_markdown("document.pdf", renderer_options=md_opts)
- Using both parser and renderer options:
>>> markdown = to_markdown("doc.pdf", ... parser_options=PdfOptions(pages=[0, 1]), ... renderer_options=MarkdownRendererOptions(flavor="gfm"))
- Using kwargs (automatically split):
>>> markdown = to_markdown("doc.pdf", pages=[0, 1], emphasis_symbol="_")
- Using flavor shorthand:
>>> markdown = to_markdown("document.pdf", flavor="commonmark")
- With transforms:
>>> markdown = to_markdown("doc.pdf", transforms=["remove-images"])
- From AST Document:
>>> ast_doc = to_ast("document.pdf") >>> # Apply custom processing to ast_doc... >>> markdown = to_markdown(ast_doc)
- all2md.to_ast(source: str | Path | IO[bytes] | bytes, *, parser_options: BaseParserOptions | None = None, source_format: Literal['auto', 'archive', 'asciidoc', 'ast', 'bbcode', 'chm', 'csv', 'docx', 'dokuwiki', 'eml', 'enex', 'epub', 'fb2', 'html', 'ini', 'ipynb', 'jinja', 'json', 'latex', 'markdown', 'mbox', 'mediawiki', 'mhtml', 'odp', 'ods', 'odt', 'openapi', 'org', 'outlook', 'pdf', 'plaintext', 'pptx', 'rst', 'rtf', 'sourcecode', 'textile', 'toml', 'webarchive', 'xlsx', 'yaml', 'zip'] = 'auto', progress_callback: Callable[[ProgressEvent], None] | None = None, remote_input_options: RemoteInputOptions | None = None, **kwargs: Any) Document
Convert document to AST (Abstract Syntax Tree) format.
This function provides advanced users with direct access to the document AST, enabling custom processing, transformation, and analysis of document structure. The AST can be manipulated using utilities from all2md.ast.transforms and serialized to JSON using all2md.ast.serialization.
- Parameters:
source (str, Path, IO[bytes], or bytes) – Source document data, which can be a file path, a file-like object, or raw bytes.
parser_options (BaseParserOptions, optional) – Pre-configured parser options for format-specific parsing settings (e.g., PdfOptions, DocxOptions, HtmlOptions).
source_format (DocumentFormat, default "auto") – Explicitly specify the source document format. If “auto”, the format is detected from the filename or content.
progress_callback (ProgressCallback, optional) – Optional callback function for progress updates. Receives ProgressEvent objects with event_type, message, current/total counts, and metadata. See all2md.progress for details.
remote_input_options (RemoteInputOptions, optional) – Controls remote retrieval behaviour for the source input. Defaults to None (remote fetching disabled).
kwargs (Any) – Individual parser options that override settings in parser_options.
- Returns:
AST Document node representing the document structure
- Return type:
- Raises:
FormatError – If the format cannot be detected or is unsupported
DependencyError – If required dependencies for the format are not installed
ParsingError – If conversion fails
Examples
- Get AST from a document:
>>> from all2md import to_ast >>> ast_doc = to_ast("document.pdf")
- Manipulate AST and convert to markdown:
>>> from all2md.ast import transforms >>> from all2md.renderers.markdown import MarkdownRenderer >>> ast_doc = to_ast("document.pdf") >>> filtered_doc = transforms.filter_nodes(ast_doc, lambda n: not isinstance(n, Image)) >>> renderer = MarkdownRenderer() >>> markdown = renderer.render_to_string(filtered_doc)
- Extract specific nodes:
>>> from all2md.ast import transforms, Heading >>> ast_doc = to_ast("document.docx") >>> headings = transforms.extract_nodes(ast_doc, Heading)
- Serialize to JSON:
>>> from all2md.ast import serialization >>> ast_doc = to_ast("document.html") >>> json_str = serialization.ast_to_json(ast_doc, indent=2)
- all2md.from_ast(ast_doc: Document, target_format: Literal['auto', 'archive', 'asciidoc', 'ast', 'bbcode', 'chm', 'csv', 'docx', 'dokuwiki', 'eml', 'enex', 'epub', 'fb2', 'html', 'ini', 'ipynb', 'jinja', 'json', 'latex', 'markdown', 'mbox', 'mediawiki', 'mhtml', 'odp', 'ods', 'odt', 'openapi', 'org', 'outlook', 'pdf', 'plaintext', 'pptx', 'rst', 'rtf', 'sourcecode', 'textile', 'toml', 'webarchive', 'xlsx', 'yaml', 'zip'], output: str | Path | IO[bytes] | IO[str] | None = None, *, renderer_options: BaseRendererOptions | None = None, transforms: list | None = None, hooks: dict | None = None, progress_callback: Callable[[ProgressEvent], None] | None = None, preserve_formatting: bool = False, **kwargs: Any) None | str | bytes
Render AST document to a target format.
- Parameters:
ast_doc (Document) – AST Document node to render
target_format (DocumentFormat) – Target format name (e.g., “markdown”, “docx”, “pdf”)
output (str, Path, IO[bytes], IO[str], or None, optional) – Output destination. If None, returns rendered content directly. Can be: - None: Returns str (for text formats) or bytes (for binary formats) - str or Path: Writes content to file at that path - IO[bytes]: Writes content to binary file-like object - IO[str]: Writes content to text file-like object
renderer_options (BaseRendererOptions, optional) – Renderer options for the target format
transforms (list, optional) – AST transforms to apply before rendering
hooks (dict, optional) – Transform hooks to execute during processing
progress_callback (ProgressCallback, optional) – Optional callback function for progress updates. Receives ProgressEvent objects with event_type, message, current/total counts, and metadata. See all2md.progress for details.
preserve_formatting (bool, default False) – When True and
target_formatis"docx", use the AST’s stashedsource_path(populated byto_astfor file-based inputs) as the rendering template and clear its body before rendering. This preserves page setup, theme, headers/footers, and custom style definitions from the original document on a docx round-trip. Ignored if no source path is stashed or the caller already specified atemplate_path.kwargs (Any) – Additional renderer options that override renderer_options
- Returns:
None if output was specified (content written to output)
str if output=None and format is text-based (markdown, html, rst, etc.)
bytes if output=None and format is binary (docx, pdf, epub, etc.)
- Return type:
None, str, or bytes
Notes
If you need a file-like object instead of direct content, pass a StringIO or BytesIO instance to the output parameter:
>>> from io import StringIO, BytesIO >>> buffer = StringIO() >>> from_ast(doc, "markdown", output=buffer) # Returns None, buffer populated >>> markdown_text = buffer.getvalue()
Examples
- Render AST to string (text formats):
>>> ast_doc = to_ast("document.pdf") >>> markdown_text = from_ast(ast_doc, "markdown") >>> isinstance(markdown_text, str) True
- Render AST to bytes (binary formats):
>>> pdf_bytes = from_ast(ast_doc, "pdf") >>> isinstance(pdf_bytes, bytes) True
- Render AST to file:
>>> from_ast(ast_doc, "markdown", output="output.md")
- With renderer options:
>>> md_opts = MarkdownRendererOptions(flavor="commonmark") >>> markdown_text = from_ast(ast_doc, "markdown", renderer_options=md_opts)
- all2md.from_markdown(source: str | Path | IO[bytes] | IO[str], target_format: Literal['auto', 'archive', 'asciidoc', 'ast', 'bbcode', 'chm', 'csv', 'docx', 'dokuwiki', 'eml', 'enex', 'epub', 'fb2', 'html', 'ini', 'ipynb', 'jinja', 'json', 'latex', 'markdown', 'mbox', 'mediawiki', 'mhtml', 'odp', 'ods', 'odt', 'openapi', 'org', 'outlook', 'pdf', 'plaintext', 'pptx', 'rst', 'rtf', 'sourcecode', 'textile', 'toml', 'webarchive', 'xlsx', 'yaml', 'zip'], output: str | Path | IO[bytes] | IO[str] | None = None, *, parser_options: MarkdownParserOptions | None = None, renderer_options: BaseRendererOptions | None = None, transforms: list | None = None, hooks: dict | None = None, progress_callback: Callable[[ProgressEvent], None] | None = None, preserve_formatting: bool = False, **kwargs: Any) None | str | bytes
Convert Markdown content to another format.
- Parameters:
source (str, Path, IO[bytes], or IO[str]) – Markdown source content as string, file path, or file-like object
target_format (DocumentFormat) – Target format name (e.g., “docx”, “pdf”, “html”)
output (str, Path, IO[bytes], IO[str], or None, optional) – Output destination. If None, returns rendered content. Can be: - None: Returns str (for text formats) or bytes (for binary formats) - str or Path: Writes content to file at that path - IO[bytes]: Writes content to binary file-like object - IO[str]: Writes content to text file-like object
parser_options (MarkdownParserOptions, optional) – Options for parsing Markdown
renderer_options (BaseRendererOptions, optional) – Options for rendering to target format
transforms (list, optional) – AST transforms to apply
hooks (dict, optional) – Transform hooks to execute
progress_callback (ProgressCallback, optional) – Optional callback function for progress updates. Receives ProgressEvent objects with event_type, message, current/total counts, and metadata. See all2md.progress for details.
preserve_formatting (bool, default False) – When True and
target_formatis"docx", use the AST’s stashedsource_pathas a rendering template and clear its body. Only useful when the markdown source was originally derived from a docx file whose path is still available; in that case passtemplate_pathexplicitly instead. Seefrom_astfor details.kwargs (Any) – Additional options split between parser and renderer
- Returns:
None if output was specified (content written to output)
str if output=None and format is text-based (html, rst, etc.)
bytes if output=None and format is binary (docx, pdf, epub, etc.)
- Return type:
None, str, or bytes
Notes
If you need a file-like object instead of direct content, pass a StringIO or BytesIO instance to the output parameter:
>>> from io import StringIO, BytesIO >>> buffer = StringIO() >>> from_markdown("# Title", "html", output=buffer) # Returns None >>> html_text = buffer.getvalue()
Examples
- Convert markdown string to HTML:
>>> html_text = from_markdown("# Title\\n\\nContent", "html") >>> isinstance(html_text, str) True
- Convert markdown to binary format:
>>> pdf_bytes = from_markdown("# Title", "pdf") >>> isinstance(pdf_bytes, bytes) True
- Convert markdown file to DOCX file:
>>> from_markdown("input.md", "docx", output="output.docx")
- With options:
>>> html_content = from_markdown("input.md", "html", ... parser_options=MarkdownParserOptions(flavor="gfm"), ... renderer_options=HtmlOptions(...))
- all2md.convert(source: str | Path | IO[bytes] | IO[str] | bytes, output: str | Path | IO[bytes] | IO[str] | None = None, *, parser_options: BaseParserOptions | None = None, renderer_options: BaseRendererOptions | None = None, source_format: Literal['auto', 'archive', 'asciidoc', 'ast', 'bbcode', 'chm', 'csv', 'docx', 'dokuwiki', 'eml', 'enex', 'epub', 'fb2', 'html', 'ini', 'ipynb', 'jinja', 'json', 'latex', 'markdown', 'mbox', 'mediawiki', 'mhtml', 'odp', 'ods', 'odt', 'openapi', 'org', 'outlook', 'pdf', 'plaintext', 'pptx', 'rst', 'rtf', 'sourcecode', 'textile', 'toml', 'webarchive', 'xlsx', 'yaml', 'zip'] = 'auto', target_format: Literal['auto', 'archive', 'asciidoc', 'ast', 'bbcode', 'chm', 'csv', 'docx', 'dokuwiki', 'eml', 'enex', 'epub', 'fb2', 'html', 'ini', 'ipynb', 'jinja', 'json', 'latex', 'markdown', 'mbox', 'mediawiki', 'mhtml', 'odp', 'ods', 'odt', 'openapi', 'org', 'outlook', 'pdf', 'plaintext', 'pptx', 'rst', 'rtf', 'sourcecode', 'textile', 'toml', 'webarchive', 'xlsx', 'yaml', 'zip'] = 'auto', transforms: list | None = None, hooks: dict | None = None, renderer: str | type | object | None = None, flavor: str | None = None, progress_callback: Callable[[ProgressEvent], None] | None = None, remote_input_options: RemoteInputOptions | None = None, preserve_formatting: bool = False, **kwargs: Any) None | str | bytes
Convert between document formats.
- Parameters:
source (str, Path, IO[bytes], IO[str], or bytes) – Source document (file path, file-like object, or content)
output (str, Path, IO[bytes], IO[str], or None, optional) – Output destination. If None, returns rendered content. Can be: - None: Returns str (for text formats) or bytes (for binary formats) - str or Path: Writes content to file at that path - IO[bytes]: Writes content to binary file-like object - IO[str]: Writes content to text file-like object
parser_options (BaseParserOptions, optional) – Options for parsing source format
renderer_options (BaseRendererOptions, optional) – Options for rendering target format
source_format (DocumentFormat, default "auto") – Source format (auto-detected if “auto”)
target_format (DocumentFormat, default "auto") – Target format (inferred from output or defaults to “markdown”)
transforms (list, optional) – AST transforms to apply
hooks (dict, optional) – Transform hooks to execute
renderer (str, type, or object, optional) – Custom renderer (overrides target_format)
flavor (str, optional) – Markdown flavor shorthand for renderer_options
progress_callback (ProgressCallback, optional) – Optional callback function for progress updates. Receives ProgressEvent objects with event_type, message, current/total counts, and metadata. See all2md.progress for details.
remote_input_options (RemoteInputOptions, optional) – Controls remote retrieval behaviour for the source input. Defaults to None (remote fetching disabled).
preserve_formatting (bool, default False) – When True and the target is
"docx"and the source is a docx file, the rendered output uses the source as its template and the source’s body is cleared before rendering. This makes a docx round-trip (e.g.convert("in.docx", "out.docx")) preserve page setup, theme, headers/footers, and custom paragraph styles instead of regenerating a generic-looking document.kwargs (Any) – Additional options split between parser and renderer
- Returns:
None if output was specified (content written to output)
str if output=None and format is text-based (markdown, html, rst, etc.)
bytes if output=None and format is binary (docx, pdf, epub, etc.)
- Return type:
None, str, or bytes
Notes
If you need a file-like object instead of direct content, pass a StringIO or BytesIO instance to the output parameter:
>>> from io import StringIO, BytesIO >>> buffer = StringIO() >>> convert("doc.pdf", output=buffer, target_format="markdown") # Returns None >>> markdown_text = buffer.getvalue()
Examples
- Convert PDF to markdown:
>>> markdown_text = convert("doc.pdf", target_format="markdown") >>> isinstance(markdown_text, str) True
- Convert to binary format:
>>> pdf_bytes = convert("input.md", target_format="pdf") >>> isinstance(pdf_bytes, bytes) True
- Convert with output file:
>>> convert("doc.pdf", "output.md", ... parser_options=PdfOptions(pages=[0, 1]), ... renderer_options=MarkdownRendererOptions(flavor="commonmark"))
- Bidirectional with transforms:
>>> convert("input.docx", "output.md", ... transforms=["remove-images", "heading-offset"])
- class all2md.ProgressEvent
Bases:
objectProgress event for document conversion operations.
This class represents a single progress event emitted during document conversion. Events use a canonical set of event types with documented semantics to ensure predictable external integration and consistent progress reporting.
- Parameters:
event_type (EventType) –
Type of progress event (canonical types):
- ”started”: Conversion/parsing has begun
Use at the start of any conversion operation. Set total to expected number of items if known.
- ”item_done”: A discrete unit has been completed
Generic event for any completed unit: page, section, file, stage, etc. Use metadata[“item_type”] to specify what was completed. Examples: page, slide, section, tokenization, preamble, structure
- ”detected”: Something discovered during parsing
Use when finding notable structures during parsing. Use metadata[“detected_type”] to specify what was detected. Examples: table, image, chart, heading, reference
- ”finished”: Conversion/parsing completed successfully
Use at the end of successful conversion. Set current=total to indicate completion.
- ”error”: An error occurred during conversion
Use when errors occur. Include details in metadata[“error”]. Conversion may continue after errors for partial results.
message (str) – Human-readable description of the event
current (int, default 0) – Current progress position (e.g., current page number, items completed)
total (int, default 0) – Total items to process (e.g., total pages). Set to 0 if unknown.
metadata (dict, default empty) –
Additional event-specific information:
For “started”: Optional context about the operation
For “item_done”: {“item_type”: str} - type of item completed
For “detected”: {“detected_type”: str, additional context}
For “error”: {“error”: str, “stage”: str, additional context}
Examples
- Started event:
>>> event = ProgressEvent("started", "Converting document.pdf", current=0, total=10)
- Item completed (page):
>>> event = ProgressEvent( ... "item_done", ... "Page 3 of 10", ... current=3, ... total=10, ... metadata={"item_type": "page"} ... )
- Item completed (parsing stage):
>>> event = ProgressEvent( ... "item_done", ... "Tokenization complete", ... current=30, ... total=100, ... metadata={"item_type": "tokenization"} ... )
- Structure detected:
>>> event = ProgressEvent( ... "detected", ... "Found 2 tables on page 5", ... current=5, ... total=10, ... metadata={"detected_type": "table", "table_count": 2, "page": 5} ... )
Error
>>> event = ProgressEvent( ... "error", ... "Failed to parse page 7", ... current=7, ... total=10, ... metadata={"error": "Invalid PDF structure", "stage": "page_parsing", "page": 7} ... )
- Finished:
>>> event = ProgressEvent("finished", "Conversion complete", current=10, total=10)
Notes
Legacy event types (“page_done”, “table_detected”, “tokenization_done”, etc.) are deprecated in favor of canonical types with metadata. Parsers should migrate: - “page_done” -> “item_done” with metadata={“item_type”: “page”} - “table_detected” -> “detected” with metadata={“detected_type”: “table”} - “tokenization_done” -> “item_done” with metadata={“item_type”: “tokenization”}
- event_type: Literal['started', 'item_done', 'detected', 'finished', 'error']
- message: str
- current: int = 0
- total: int = 0
- metadata: dict[str, Any]
- __init__(event_type: ~typing.Literal['started', 'item_done', 'detected', 'finished', 'error'], message: str, current: int = 0, total: int = 0, metadata: dict[str, ~typing.Any] = <factory>) None
- class all2md.BaseRendererOptions
Bases:
CloneFrozenMixinBase class for all renderer options.
This class serves as the foundation for format-specific renderer options. Renderers convert AST documents into various output formats (Markdown, DOCX, PDF, etc.).
- Parameters:
fail_on_resource_errors (bool, default=False) – Whether to raise RenderingError when resource loading fails (e.g., images). If False (default), warnings are logged but rendering continues. If True, rendering stops immediately on resource errors.
max_asset_size_bytes (int) – Maximum allowed size in bytes for any single asset (images, downloads, etc.)
Notes
Subclasses should define format-specific rendering options as frozen dataclass fields.
- fail_on_resource_errors: bool = False
- max_asset_size_bytes: int = 52428800
- metadata_policy: MetadataRenderPolicy
- creator: str | None = 'all2md'
- __init__(fail_on_resource_errors: bool = False, max_asset_size_bytes: int = 52428800, metadata_policy: MetadataRenderPolicy = <factory>, creator: str | None = 'all2md') None
- class all2md.BaseParserOptions
Bases:
CloneFrozenMixinBase class for all parser options.
This class serves as the foundation for format-specific parser options. Parsers convert source documents into AST representation.
For parsers that handle attachments (images, downloads, etc.), also inherit from AttachmentOptionsMixin to get attachment-related configuration fields.
- Parameters:
extract_metadata (bool) – Whether to extract document metadata
Notes
Subclasses should define format-specific parsing options as frozen dataclass fields.
For parsers handling binary assets (PDF, DOCX, HTML, etc.), also inherit from AttachmentOptionsMixin:
@dataclass(frozen=True) class PdfOptions(BaseParserOptions, AttachmentOptionsMixin): pass
- extract_metadata: bool = False
- __init__(extract_metadata: bool = False) None
- class all2md.NetworkFetchOptions
Bases:
CloneFrozenMixinNetwork security options for remote resource fetching.
This dataclass contains settings that control how remote resources (images, CSS, etc.) are fetched, including security constraints to prevent SSRF attacks.
- Parameters:
allow_remote_fetch (bool, default False) – Whether to allow fetching remote URLs for images and other resources. When False, prevents SSRF attacks by blocking all network requests.
allowed_hosts (list[str] | None, default None) – List of allowed hostnames or CIDR blocks for remote fetching. If None and allow_remote_fetch=True, all hosts are allowed, which may pose an SSRF (Server-Side Request Forgery) risk. A security warning will be logged. In security-sensitive contexts, explicitly set this to an allowlist of trusted hosts.
require_https (bool, default False) – Whether to require HTTPS for all remote URL fetching.
network_timeout (float, default 10.0) – Timeout in seconds for remote URL fetching.
max_requests_per_second (float, default 10.0) – Maximum number of network requests per second (rate limiting).
max_concurrent_requests (int, default 5) – Maximum number of concurrent network requests.
Notes
Asset size limits are inherited from BaseParserOptions.max_asset_size_bytes.
- allow_remote_fetch: bool = False
- allowed_hosts: list[str] | None = None
- require_https: bool = True
- require_head_success: bool = True
- network_timeout: float = 10.0
- max_redirects: int = 5
- allowed_content_types: tuple[str, ...] | None = ('image/',)
- max_requests_per_second: float = 10.0
- max_concurrent_requests: int = 5
- __init__(allow_remote_fetch: bool = False, allowed_hosts: list[str] | None = None, require_https: bool = True, require_head_success: bool = True, network_timeout: float = 10.0, max_redirects: int = 5, allowed_content_types: tuple[str, ...] | None = ('image/',), max_requests_per_second: float = 10.0, max_concurrent_requests: int = 5) None
- class all2md.LocalFileAccessOptions
Bases:
CloneFrozenMixinLocal file access security options.
This dataclass contains settings that control access to local files via file:// URLs and similar mechanisms.
- Parameters:
allow_local_files (bool, default False) – Whether to allow access to local files via file:// URLs.
local_file_allowlist (list[str] | None, default None) – List of directories allowed for local file access. Only applies when allow_local_files=True.
local_file_denylist (list[str] | None, default None) – List of directories denied for local file access.
allow_cwd_files (bool, default False) – Whether to allow local files from current working directory and subdirectories.
- allow_local_files: bool = False
- local_file_allowlist: list[str] | None = None
- local_file_denylist: list[str] | None = None
- allow_cwd_files: bool = False
- __init__(allow_local_files: bool = False, local_file_allowlist: list[str] | None = None, local_file_denylist: list[str] | None = None, allow_cwd_files: bool = False) None
- class all2md.HtmlRendererOptions
Bases:
BaseRendererOptionsConfiguration options for rendering AST to HTML format.
This dataclass contains settings specific to HTML generation, including document structure, styling, templating, and feature toggles.
- Parameters:
standalone (bool, default True) – Generate complete HTML document with <html>, <head>, <body> tags. If False, generates only the content fragment. Ignored when template_mode is not None.
css_style ({"inline", "embedded", "external", "none"}, default "embedded") – How to include CSS styles: - “inline”: Add style attributes to elements - “embedded”: Include <style> block in <head> - “external”: Reference external CSS file - “none”: No styling
css_file (str or None, default None) – Path to external CSS file (used when css_style=”external”).
include_toc (bool, default False) – Generate table of contents from headings.
syntax_highlighting (bool, default True) – Add language classes to code blocks for syntax highlighting.
escape_html (bool, default True) – Escape HTML special characters in text content.
math_renderer ({"mathjax", "katex", "none"}, default "mathjax") – Math rendering library to use for MathML/LaTeX math: - “mathjax”: Include MathJax CDN script - “katex”: Include KaTeX CDN script - “none”: Render math as plain text
html_passthrough_mode ({"pass-through", "escape", "drop", "sanitize"}, default "pass-through") – How to handle HTMLBlock and HTMLInline nodes: - “pass-through”: Pass through unchanged (use only with trusted content) - “escape”: HTML-escape the content - “drop”: Remove HTML content entirely - “sanitize”: Remove dangerous elements/attributes (requires bleach for best results)
language (str, default "en") – Document language code (ISO 639-1) for the <html lang=”…”> attribute. Can be overridden by document metadata.
template_mode ({"inject", "replace", "jinja"} or None, default None) – Template mode for rendering HTML: - None: Use standalone mode (default behavior) - “inject”: Inject content into existing HTML file at selector - “replace”: Replace placeholders in template file - “jinja”: Use Jinja2 template engine with full context When set, standalone is ignored.
template_file (str or None, default None) – Path to template file (required when template_mode is not None).
template_selector (str, default "#content") – CSS selector for injection target (used with template_mode=”inject”).
toc_selector (str or None, default None) – CSS selector for separate TOC injection point (used with template_mode=”inject”). If not set, TOC is included with content at template_selector. Allows placing TOC in a different location like a sidebar or header.
injection_mode ({"append", "prepend", "replace"}, default "replace") – How to inject content at selector (used with template_mode=”inject”): - “append”: Add content after existing content - “prepend”: Add content before existing content - “replace”: Replace existing content
content_placeholder (str, default "{CONTENT}") – Placeholder string to replace with content (used with template_mode=”replace”).
css_class_map (dict[str, str | list[str]] or None, default None) – Map AST node type names to custom CSS classes. Example: {“Heading”: “article-heading”, “CodeBlock”: [“code”, “highlight”]}
allow_remote_scripts (bool, default False) – Allow loading remote scripts (e.g., MathJax/KaTeX from CDN). Default is False for security - requires explicit opt-in for CDN usage. When False and math_renderer != ‘none’, will raise a warning.
csp_enabled (bool, default False) – Add Content-Security-Policy meta tag to standalone HTML documents. Helps prevent XSS attacks by restricting resource loading.
csp_policy (str or None, default (secure policy)) – Custom Content-Security-Policy header value. If None, uses default: “default-src ‘self’; script-src ‘self’; style-src ‘self’ ‘unsafe-inline’;”
comment_mode ({"native", "visible", "ignore"}, default "native") – How to render Comment and CommentInline AST nodes: - “native”: Render as HTML comments (<!– Comment by Author: text –>) - “visible”: Render as visible <div>/<span> elements with class=”comment” and metadata in data attributes - “ignore”: Skip comment nodes entirely This controls presentation of comments from DOCX reviewer comments, source HTML comments, and other format-specific annotations.
Examples
- Inject into existing HTML:
>>> options = HtmlRendererOptions( ... template_mode="inject", ... template_file="layout.html", ... template_selector="#main-content" ... )
- Replace placeholders:
>>> options = HtmlRendererOptions( ... template_mode="replace", ... template_file="template.html", ... content_placeholder="{CONTENT}" ... )
- Use Jinja2 template:
>>> options = HtmlRendererOptions( ... template_mode="jinja", ... template_file="article.html" ... )
- Custom CSS classes:
>>> options = HtmlRendererOptions( ... css_class_map={"Heading": "prose-heading", "CodeBlock": "code-block"} ... )
- standalone: bool = True
- css_style: Literal['inline', 'embedded', 'external', 'none'] = 'embedded'
- css_file: str | None = None
- include_toc: bool = False
- syntax_highlighting: bool = True
- escape_html: bool = True
- math_renderer: Literal['mathjax', 'katex', 'none'] = 'mathjax'
- html_passthrough_mode: Literal['pass-through', 'escape', 'drop', 'sanitize'] = 'escape'
- language: str = 'en'
- template_mode: Literal['inject', 'replace', 'jinja'] | None = None
- template_file: str | None = None
- template_selector: str = '#content'
- toc_selector: str | None = None
- injection_mode: Literal['append', 'prepend', 'replace'] = 'replace'
- content_placeholder: str = '{CONTENT}'
- css_class_map: dict[str, str | list[str]] | None = None
- allow_remote_scripts: bool = False
- csp_enabled: bool = True
- csp_policy: str | None = "default-src 'self'; script-src 'self'; style-src 'self' 'unsafe-inline';"
- comment_mode: Literal['native', 'visible', 'ignore'] = 'native'
- __init__(fail_on_resource_errors: bool = False, max_asset_size_bytes: int = 52428800, metadata_policy: MetadataRenderPolicy = <factory>, creator: str | None = 'all2md', standalone: bool = True, css_style: CssStyle = 'embedded', css_file: str | None = None, include_toc: bool = False, syntax_highlighting: bool = True, escape_html: bool = True, math_renderer: MathRenderer = 'mathjax', html_passthrough_mode: HtmlPassthroughMode = 'escape', language: str = 'en', template_mode: TemplateMode | None = None, template_file: str | None = None, template_selector: str = '#content', toc_selector: str | None = None, injection_mode: InjectionMode = 'replace', content_placeholder: str = '{CONTENT}', css_class_map: dict[str, str | list[str]] | None=None, allow_remote_scripts: bool = False, csp_enabled: bool = True, csp_policy: str | None = "default-src 'self'; script-src 'self'; style-src 'self' 'unsafe-inline';", comment_mode: HtmlCommentMode = 'native') None
- class all2md.HtmlOptions
Bases:
BaseParserOptions,AttachmentOptionsMixinConfiguration options for HTML-to-Markdown conversion.
This dataclass contains settings specific to HTML document processing, including heading styles, title extraction, image handling, content sanitization, and advanced formatting options. Inherits attachment handling from AttachmentOptionsMixin for images and embedded media.
- Parameters:
extract_title (bool, default False) – Whether to extract and use the HTML <title> element.
convert_nbsp (bool, default False) – Whether to convert non-breaking spaces ( ) to regular spaces in the output.
strip_dangerous_elements (bool, default False) – Whether to remove potentially dangerous HTML elements (script, style, etc.) and event handler attributes (onclick, onload, etc.).
strip_framework_attributes (bool, default False) – Whether to remove JavaScript framework attributes (Alpine.js x-, Vue.js v-, Angular ng-, HTMX hx-, etc.) that can execute code in framework contexts. Only needed if output HTML will be rendered in browsers with these frameworks.
detect_table_alignment (bool, default True) – Whether to automatically detect table column alignment from CSS/attributes.
preserve_nested_structure (bool, default True) – Whether to maintain proper nesting for blockquotes and other elements.
allowed_attributes (tuple[str, ...] | dict[str, tuple[str, ...]] | None, default None) – Whitelist of allowed HTML attributes. Supports two modes: - Global allowlist: tuple of attribute names applied to all elements - Per-element allowlist: dict mapping element names to tuples of allowed attributes Note: When using CLI, pass complex dict structures as JSON strings for proper parsing.
base_url (str or None, default None) – Base URL for resolving relative hrefs in <a> tags. This is separate from attachment_base_url (used for images/assets). Allows precise control over navigational link URLs vs. resource URLs.
Examples
- Convert and extract page title:
>>> options = HtmlOptions(extract_title=True)
- Convert with content sanitization:
>>> options = HtmlOptions(strip_dangerous_elements=True, convert_nbsp=True)
- Use global attribute allowlist:
>>> options = HtmlOptions(allowed_attributes=('class', 'id', 'href', 'src'))
- Use per-element attribute allowlist:
>>> options = HtmlOptions(allowed_attributes={ ... 'img': ('src', 'alt', 'title'), ... 'a': ('href', 'title'), ... 'div': ('class', 'id') ... })
- Extract only the readable article content:
>>> options = HtmlOptions(extract_readable=True)
- extract_title: bool = False
- convert_nbsp: bool = False
- strip_dangerous_elements: bool = False
- strip_framework_attributes: bool = False
- detect_table_alignment: bool = True
- network: NetworkFetchOptions
- local_files: LocalFileAccessOptions
- strip_comments: bool = True
- collapse_whitespace: bool = True
- extract_readable: bool = False
- br_handling: Literal['newline', 'space'] = 'newline'
- allowed_elements: tuple[str, ...] | None = None
- allowed_attributes: tuple[str, ...] | dict[str, tuple[str, ...]] | None = None
- figures_parsing: Literal['blockquote', 'paragraph', 'image_with_caption', 'caption_only', 'html', 'skip'] = 'blockquote'
- details_parsing: Literal['blockquote', 'paragraph', 'html', 'skip'] = 'blockquote'
- extract_microdata: bool = True
- base_url: str | None = None
- html_parser: Literal['html.parser', 'html5lib', 'lxml'] = 'html.parser'
- __init__(attachment_mode: AttachmentMode = 'alt_text', alt_text_mode: AltTextMode = 'default', attachment_output_dir: str | None = None, attachment_base_url: str | None = None, max_asset_size_bytes: int = 52428800, attachment_filename_template: str = '{stem}_{type}{seq}.{ext}', attachment_overwrite: AttachmentOverwriteMode = 'unique', attachment_deduplicate_by_hash: bool = False, attachments_footnotes_section: str | None = 'Attachments', extract_metadata: bool = False, extract_title: bool = False, convert_nbsp: bool = False, strip_dangerous_elements: bool = False, strip_framework_attributes: bool = False, detect_table_alignment: bool = True, network: NetworkFetchOptions = <factory>, local_files: LocalFileAccessOptions = <factory>, strip_comments: bool = True, collapse_whitespace: bool = True, extract_readable: bool = False, br_handling: BrHandling = 'newline', allowed_elements: tuple[str, ...] | None=None, allowed_attributes: tuple[str, ...] | dict[str, tuple[str, ...]] | None=None, figures_parsing: FiguresParsing = 'blockquote', details_parsing: DetailsParsing = 'blockquote', extract_microdata: bool = True, base_url: str | None = None, html_parser: HtmlParser = 'html.parser') None
- exception all2md.DependencyError
Bases:
All2MdErrorException raised when required dependencies are not available.
This exception is raised when attempting to use a converter that requires external packages that are not installed or don’t meet version requirements.
- Parameters:
converter_name (str) – Name of the converter requiring dependencies
missing_packages (list[tuple[str, str]]) – List of (package_name, version_spec) tuples for missing packages
version_mismatches (list[tuple[str, str, str]], optional) – List of (package_name, required_version, installed_version) tuples for packages with version mismatches
install_command (str, optional) – Suggested pip install command to resolve the issue
message (str, optional) – Custom error message. If not provided, generates a helpful message
- Variables:
converter_name (str) – The converter that has missing dependencies
missing_packages (list[tuple[str, str]]) – Packages that need to be installed
version_mismatches (list[tuple[str, str, str]]) – Packages with version mismatches
install_command (str) – Command to install missing dependencies
Initialize the dependency error with package details.
- __init__(converter_name: str, missing_packages: list[tuple[str, str]], version_mismatches: list[tuple[str, str, str]] | None = None, install_command: str = '', message: str | None = None, original_import_error: ImportError | None = None)
Initialize the dependency error with package details.
- exception all2md.All2MdError
Bases:
ExceptionBase exception class for all all2md-specific errors.
This serves as the root exception class for all custom exceptions raised by the all2md library. Catching this will catch all library-specific errors.
- Parameters:
message (str) – Human-readable description of the error
original_error (Exception, optional) – The original exception that caused this error, if applicable
- Variables:
message (str) – The error message
original_error (Exception or None) – The wrapped original exception, if any
Initialize the error with a message and optional original exception.
- __init__(message: str, original_error: Exception | None = None)
Initialize the error with a message and optional original exception.
- exception all2md.FormatError
Bases:
All2MdErrorException raised when attempting to process an unsupported file format.
This exception indicates that the requested file format or conversion operation is not supported by the current version of all2md.
- Parameters:
message (str, optional) – Custom error message
format_type (str, optional) – The unsupported format type (file extension or MIME type)
supported_formats (list[str], optional) – List of supported formats for reference
original_error (Exception, optional) – The original exception that caused this error
- Variables:
format_type (str or None) – The format that was not supported
supported_formats (list[str] or None) – Available supported formats
Initialize the format error.
- __init__(message: str | None = None, format_type: str | None = None, supported_formats: list[str] | None = None, original_error: Exception | None = None)
Initialize the format error.
- exception all2md.ParsingError
Bases:
All2MdErrorException raised when document parsing fails.
This exception is raised when the parsing process encounters an error that prevents successful completion, such as: - Malformed document structure - Unsupported document features - Password-protected files
- Parameters:
message (str) – Description of the parsing failure
parsing_stage (str, optional) – The stage of parsing where the error occurred
original_error (Exception, optional) – The underlying exception that caused the parsing failure
- Variables:
parsing_stage (str or None) – Where in the parsing process the error occurred
Initialize the parsing error.
- __init__(message: str, parsing_stage: str | None = None, original_error: Exception | None = None)
Initialize the parsing error.
For organized API documentation, see the API Reference which groups modules by functionality.