all2md.transforms

Transform system for AST manipulation.

This package provides a plugin-based transformation system for manipulating AST structures before rendering. It includes:

  • Transform registry for plugin discovery

  • Hook system for pipeline interception

  • Metadata classes for transform description

  • Built-in transforms for common operations

The transform system uses Python entry points for plugin discovery, allowing third-party packages to register custom transforms.

Examples

Use a transform by name:

>>> from all2md import to_ast
>>> from all2md.transforms import render
>>> doc = to_ast("document.pdf")
>>> markdown = render(doc, transforms=['remove-images'])

Use a transform instance with parameters:

>>> from all2md.transforms import render, HeadingOffsetTransform
>>> markdown = render(
...     doc,
...     transforms=[HeadingOffsetTransform(offset=1)]
... )

Register a custom transform:

>>> from all2md.transforms import transform_registry, TransformMetadata
>>> from all2md.ast.transforms import NodeTransformer
>>>
>>> class MyTransform(NodeTransformer):
...     pass
>>>
>>> metadata = TransformMetadata(
...     name="my-transform",
...     description="My custom transform",
...     transformer_class=MyTransform
... )
>>> transform_registry.register(metadata)

Use hooks for element-specific processing:

>>> def log_images(node, context):
...     print(f"Image: {node.url}")
...     return node
>>>
>>> from all2md.transforms import HookManager
>>> hooks = {'image': [log_images]}
>>> markdown = render(doc, hooks=hooks)
class all2md.transforms.TransformMetadata

Bases: object

Metadata for a transform.

This class describes a transform for registration, discovery, and CLI integration. It follows the same pattern as ConverterMetadata for consistency.

Parameters:
  • name (str) – Unique identifier for the transform (e.g., “remove-images”)

  • description (str) – Human-readable description of what the transform does

  • transformer_class (type[NodeTransformer]) – The transform class (must inherit from NodeTransformer)

  • parameters (dict[str, ParameterSpec], default = empty dict) – Parameters accepted by the transform constructor

  • priority (int, default = 100) – Execution priority (lower runs first). Used for dependency ordering

  • dependencies (list[str], default = empty list) – Names of transforms that must run before this one

  • version (str, default = "1.0.0") – Transform version (semantic versioning)

  • author (str, optional) – Transform author or maintainer

  • tags (list[str], default = empty list) – Tags for categorization (e.g., [“images”, “cleanup”])

Examples

Basic transform metadata:
>>> metadata = TransformMetadata(
...     name="remove-images",
...     description="Remove all image nodes from the AST",
...     transformer_class=RemoveImagesTransform
... )
Transform with parameters:
>>> metadata = TransformMetadata(
...     name="heading-offset",
...     description="Shift heading levels by an offset",
...     transformer_class=HeadingOffsetTransform,
...     parameters={
...         'offset': ParameterSpec(
...             type=int,
...             default=1,
...             help="Number of levels to shift (positive or negative)"
...         )
...     }
... )
Transform with dependencies:
>>> metadata = TransformMetadata(
...     name="sanitize-links",
...     description="Sanitize and validate all links",
...     transformer_class=SanitizeLinksTransform,
...     dependencies=["extract-metadata"],
...     priority=200
... )
name: str
description: str
transformer_class: Type[NodeTransformer]
parameters: dict[str, ParameterSpec]
priority: int = 100
dependencies: list[str]
version: str = '1.0.0'
author: str | None = None
tags: list[str]
create_instance(strict: bool = False, **kwargs: Any) NodeTransformer

Create an instance of the transform with given parameters.

Parameters:
  • strict (bool, default = False) – If True, log warnings for unknown parameters to aid debugging

  • **kwargs – Parameters to pass to the transform constructor

Returns:

Transform instance

Return type:

NodeTransformer

Raises:

ValueError – If required parameters are missing or validation fails

Examples

>>> metadata = TransformMetadata(
...     name="test",
...     description="Test transform",
...     transformer_class=MyTransform,
...     parameters={'threshold': ParameterSpec(type=int, default=10)}
... )
>>> instance = metadata.create_instance(threshold=20)
get_parameter_names() list[str]

Get list of parameter names.

Returns:

Parameter names

Return type:

list[str]

has_parameter(name: str) bool

Check if transform has a parameter.

Parameters:

name (str) – Parameter name

Returns:

True if parameter exists

Return type:

bool

__init__(name: str, description: str, transformer_class: ~typing.Type[~all2md.ast.transforms.NodeTransformer], parameters: dict[str, ~all2md.transforms.metadata.ParameterSpec] = <factory>, priority: int = 100, dependencies: list[str] = <factory>, version: str = '1.0.0', author: str | None = None, tags: list[str] = <factory>) None
class all2md.transforms.ParameterSpec

Bases: object

Specification for a transform parameter.

This class describes a single parameter accepted by a transform, including type information, default values, and metadata for CLI generation.

Parameters:
  • type (type) – Python type of the parameter (e.g., int, str, bool)

  • default (Any, optional) – Default value if parameter is not provided

  • help (str, optional) – Help text describing the parameter (used in CLI –help)

  • cli_flag (str, optional) – Custom CLI flag name (e.g., ‘–my-param’). If None, auto-generated from parameter name

  • required (bool, default = False) – Whether this parameter is required

  • choices (list, optional) – List of valid choices for this parameter

  • validator (callable, optional) – Custom validation function: takes value, returns bool or raises ValueError

  • element_type (type, optional) – For list parameters, the expected type of list elements (e.g., str, int)

  • expose (bool, optional) – Whether to expose this parameter on the CLI when no explicit cli_flag is provided. None defers to global defaults (currently False).

Examples

Simple parameter:
>>> param = ParameterSpec(type=int, default=10, help="Threshold value")
Parameter with choices:
>>> param = ParameterSpec(
...     type=str,
...     default="auto",
...     choices=["auto", "manual", "disabled"],
...     help="Processing mode"
... )
Required parameter with validation:
>>> def validate_positive(value):
...     if value <= 0:
...         raise ValueError("Must be positive")
...     return True
>>> param = ParameterSpec(
...     type=int,
...     required=True,
...     validator=validate_positive,
...     help="Positive integer"
... )
List parameter with element type validation:
>>> param = ParameterSpec(
...     type=list,
...     element_type=str,
...     default=["image", "table"],
...     help="Node types to remove"
... )
type: Type
default: Any = None
help: str = ''
cli_flag: str | None = None
required: bool = False
choices: list[Any] | None = None
validator: Callable[[Any], bool] | None = None
element_type: Type | None = None
expose: bool | None = None
DEFAULT_EXPOSE: ClassVar[bool] = False
validate(value: Any) bool

Validate a parameter value.

Parameters:

value (Any) – Value to validate. For list types, tuples are accepted and coerced to lists automatically.

Returns:

True if valid

Return type:

bool

Raises:

ValueError – If value is invalid

Notes

When validating list parameters, this method accepts both list and tuple types. Tuples are automatically coerced to lists to accommodate CLI parsers that often yield tuples. The coercion is transparent to the caller.

get_cli_flag(param_name: str) str

Get CLI flag name for this parameter.

Parameters:

param_name (str) – Parameter name from the transform

Returns:

CLI flag (e.g., ‘–threshold’)

Return type:

str

should_expose(default: bool | None = None) bool

Determine whether this parameter should surface in the CLI.

get_dest_name(param_name: str, transform_name: str) str

Get argparse dest name for this parameter.

This provides a consistent naming convention for transform parameters in the argparse namespace, avoiding conflicts between transforms.

Parameters:
  • param_name (str) – Parameter name from the transform

  • transform_name (str) – Name of the transform

Returns:

Destination name for argparse (e.g., ‘heading_offset_transform_offset’)

Return type:

str

Notes

The dest name is constructed to avoid collisions: - Format: f’{transform_name}_{param_name}’ - Hyphens converted to underscores for valid Python identifiers - Example: ‘heading-offset’ transform, ‘offset’ param -> ‘heading_offset_offset’

get_argparse_kwargs(param_name: str, transform_name: str) dict

Generate argparse kwargs for this parameter.

This centralizes the logic for converting ParameterSpec to argparse add_argument() kwargs, ensuring consistency between CLI argument definition and parameter extraction.

Parameters:
  • param_name (str) – Parameter name from the transform

  • transform_name (str) – Name of the transform (for help text)

Returns:

Keyword arguments for argparse.ArgumentParser.add_argument()

Return type:

dict

Notes

The returned dict includes: - ‘action’: Tracking action class (TrackingStoreAction, etc.) - ‘type’: Python type for conversion (if applicable) - ‘default’: Default value (if applicable) - ‘help’: Help text - ‘choices’: Valid choices (if specified) - ‘nargs’: Argument count (for list types) - ‘dest’: Destination name in namespace

Examples

>>> param = ParameterSpec(type=int, default=10, help="Threshold")
>>> kwargs = param.get_argparse_kwargs('threshold', 'my-transform')
>>> # Returns: {'action': TrackingStoreAction, 'type': int,
>>> #           'default': 10, 'help': 'Threshold', 'dest': 'my_transform_threshold'}
extract_value(namespace: Any, dest: str) tuple[Any, bool]

Extract parameter value from parsed argparse namespace.

This handles extracting the value and determining if it was explicitly provided by the user (vs. being a default value).

Parameters:
  • namespace (argparse.Namespace) – Parsed command line arguments

  • dest (str) – Destination name in the namespace (from get_dest_name())

Returns:

Tuple of (value, was_provided) where: - value: The parameter value (or None if not provided) - was_provided: True if user explicitly provided this value

Return type:

tuple[Any, bool]

Notes

This method checks the _provided_args set in the namespace to determine if a value was explicitly provided by the user. Only explicitly provided values should be passed to transform constructors.

Examples

>>> namespace = argparse.Namespace(
...     my_transform_threshold=20,
...     _provided_args={'my_transform_threshold'}
... )
>>> param = ParameterSpec(type=int, default=10)
>>> value, provided = param.extract_value(namespace, 'my_transform_threshold')
>>> # Returns: (20, True)
__init__(type: Type, default: Any = None, help: str = '', cli_flag: str | None = None, required: bool = False, choices: list[Any] | None = None, validator: Callable[[Any], bool] | None = None, element_type: Type | None = None, expose: bool | None = None) None
class all2md.transforms.TransformRegistry

Bases: object

Registry for managing AST transforms.

This singleton class provides a central registry for all transforms, handling: - Transform registration and discovery - Entry point plugin loading - Dependency resolution - Lazy instantiation

The registry automatically discovers transforms via the all2md.transforms entry point group on first access.

Notes

The preferred way to access the registry is by importing the global registry instance rather than instantiating this class directly. While instantiation works due to the singleton pattern, importing registry is more explicit.

Examples

Use the global registry instance (preferred):
>>> from all2md.transforms import transform_registry
>>> transform_registry.register(metadata)
Get a transform instance:
>>> from all2md.transforms import transform_registry
>>> transformer = transform_registry.get_transform("remove-images")
List all available transforms:
>>> from all2md.transforms import transform_registry
>>> transforms = transform_registry.list_transforms()

Create or return singleton instance.

static __new__(cls) TransformRegistry

Create or return singleton instance.

register(metadata: TransformMetadata) None

Register a transform with its metadata.

Parameters:

metadata (TransformMetadata) – Transform metadata to register

Notes

If a transform with the same name is already registered, it will be overwritten and a warning will be logged.

Examples

>>> metadata = TransformMetadata(
...     name="my-transform",
...     description="My custom transform",
...     transformer_class=MyTransform
... )
>>> transform_registry = TransformRegistry()
>>> transform_registry.register(metadata)
unregister(name: str) bool

Unregister a transform.

Parameters:

name (str) – Transform name to unregister

Returns:

True if transform was unregistered, False if not found

Return type:

bool

get_metadata(name: str) TransformMetadata

Get metadata for a transform.

Parameters:

name (str) – Transform name

Returns:

Transform metadata

Return type:

TransformMetadata

Raises:

KeyError – If transform is not registered

get_transform(name: str, **kwargs: Any) NodeTransformer

Get a transform instance by name.

Parameters:
  • name (str) – Transform name

  • **kwargs – Parameters to pass to transform constructor

Returns:

Transform instance

Return type:

NodeTransformer

Raises:
  • KeyError – If transform is not registered

  • ValueError – If parameters are invalid

Examples

>>> transform_registry = TransformRegistry()
>>> transformer = transform_registry.get_transform("heading-offset", offset=2)
has_transform(name: str) bool

Check if a transform is registered.

Parameters:

name (str) – Transform name

Returns:

True if transform is registered

Return type:

bool

list_transforms(tags: list[str] | None = None) list[str]

List all registered transform names.

Parameters:

tags (list[str], optional) – Filter by tags. If provided, only transforms with at least one matching tag are returned

Returns:

List of transform names, sorted alphabetically

Return type:

list[str]

Examples

List all transforms:
>>> names = transform_registry.list_transforms()
List transforms with specific tags:
>>> image_transforms = transform_registry.list_transforms(tags=["images"])
discover_plugins() int

Discover and register transforms from entry points.

This method scans for plugins using the all2md.transforms entry point group and registers all discovered transforms.

Returns:

Number of transforms discovered and registered

Return type:

int

Examples

>>> transform_registry = TransformRegistry()
>>> count = transform_registry.discover_plugins()
>>> print(f"Discovered {count} transforms")
resolve_dependencies(transform_names: list[str]) list[str]

Resolve transform dependencies and return execution order.

This method performs topological sorting using Kahn’s algorithm to determine the correct execution order based on dependencies and priorities. Priority is used as a tiebreaker among transforms with no pending dependencies.

Parameters:

transform_names (list[str]) – List of transform names to order

Returns:

Transform names in execution order (dependencies first)

Return type:

list[str]

Raises:

ValueError – If circular dependencies are detected or a dependency is not found

Examples

>>> transform_registry = TransformRegistry()
>>> ordered = transform_registry.resolve_dependencies([
...     "sanitize-links",  # depends on "extract-metadata"
...     "extract-metadata"
... ])
>>> print(ordered)
['extract-metadata', 'sanitize-links']
clear() None

Clear all registered transforms.

This is primarily useful for testing.

class all2md.transforms.HookManager

Bases: object

Manager for registering and executing hooks.

This class provides a central registry for hooks at various pipeline stages and for specific node types.

Parameters:

strict (bool, default = False) – If True, hook exceptions are re-raised and abort the pipeline. If False (default), exceptions are logged and execution continues.

Examples

Create a hook manager:
>>> manager = HookManager()
Create a strict hook manager:
>>> manager = HookManager(strict=True)
Register a pipeline hook:
>>> def pre_render_hook(doc, context):
...     print("About to render")
...     return doc
>>> manager.register_hook('pre_render', pre_render_hook)
Register a node hook:
>>> def image_hook(node, context):
...     print(f"Processing image: {node.url}")
...     return node
>>> manager.register_hook('image', image_hook)
Execute hooks:
>>> context = HookContext(document=my_doc)
>>> result = manager.execute_hooks('pre_render', my_doc, context)

Notes

In strict mode (strict=True), any exception raised by a hook will be re-raised and abort the pipeline. This is useful for debugging or when hook failures should be treated as critical errors.

In non-strict mode (strict=False, the default), exceptions are logged with full traceback but execution continues with subsequent hooks. This provides a fail-safe default that prevents a single problematic hook from breaking the entire pipeline.

Thread Safety

WARNING: HookManager instances are NOT thread-safe. Hook registration and execution use shared mutable state without synchronization.

For safe concurrent usage: - Create a separate HookManager instance per thread/pipeline (recommended) - Each Pipeline instance creates its own HookManager (default behavior) - If sharing across threads, wrap access with external locks (e.g., threading.Lock)

Initialize the hook manager.

param strict:

Enable strict mode for hook exception handling

type strict:

bool, default = False

__init__(strict: bool = False) None

Initialize the hook manager.

Parameters:

strict (bool, default = False) – Enable strict mode for hook exception handling

register_hook(target: Literal['post_ast', 'pre_transform', 'post_transform', 'pre_render', 'post_render', 'document', 'heading', 'paragraph', 'code_block', 'block_quote', 'list', 'list_item', 'table', 'table_row', 'table_cell', 'thematic_break', 'html_block', 'text', 'emphasis', 'strong', 'code', 'link', 'image', 'line_break', 'strikethrough', 'underline', 'superscript', 'subscript', 'html_inline', 'footnote_reference', 'footnote_definition', 'math_inline', 'math_block', 'definition_list', 'definition_term', 'definition_description'], hook: Callable[[Any, HookContext], Any], priority: int = 100) None

Register a hook for a target.

Parameters:
  • target (HookTarget) – Hook point or node type to hook into

  • hook (callable) – Hook function with signature: (obj, context) -> obj

  • priority (int, default = 100) – Execution priority (lower runs first)

Notes

Hooks for the same target are executed in priority order (lower first). If priorities are equal, hooks run in registration order.

Sorting is deferred until execution time for better performance when registering many hooks.

Examples

>>> manager = HookManager()
>>> manager.register_hook('image', my_image_hook, priority=50)
unregister_hook(target: Literal['post_ast', 'pre_transform', 'post_transform', 'pre_render', 'post_render', 'document', 'heading', 'paragraph', 'code_block', 'block_quote', 'list', 'list_item', 'table', 'table_row', 'table_cell', 'thematic_break', 'html_block', 'text', 'emphasis', 'strong', 'code', 'link', 'image', 'line_break', 'strikethrough', 'underline', 'superscript', 'subscript', 'html_inline', 'footnote_reference', 'footnote_definition', 'math_inline', 'math_block', 'definition_list', 'definition_term', 'definition_description'], hook: Callable[[Any, HookContext], Any]) bool

Unregister a hook.

Parameters:
  • target (HookTarget) – Hook point or node type

  • hook (callable) – Hook function to remove

Returns:

True if hook was found and removed

Return type:

bool

execute_hooks(target: Literal['post_ast', 'pre_transform', 'post_transform', 'pre_render', 'post_render', 'document', 'heading', 'paragraph', 'code_block', 'block_quote', 'list', 'list_item', 'table', 'table_row', 'table_cell', 'thematic_break', 'html_block', 'text', 'emphasis', 'strong', 'code', 'link', 'image', 'line_break', 'strikethrough', 'underline', 'superscript', 'subscript', 'html_inline', 'footnote_reference', 'footnote_definition', 'math_inline', 'math_block', 'definition_list', 'definition_term', 'definition_description'], obj: Any, context: HookContext) Any

Execute all hooks for a target.

Hooks are executed in priority order. Each hook receives the result from the previous hook. If a hook returns None, the object is removed (for node hooks).

Parameters:
  • target (HookTarget) – Hook point or node type

  • obj (Any) – Object to process (Document or Node)

  • context (HookContext) – Hook context

Returns:

Processed object (or None if removed by a hook)

Return type:

Any

Raises:

Exception – Any exception from hooks if strict mode is enabled

Examples

>>> context = HookContext(document=doc)
>>> result = manager.execute_hooks('image', image_node, context)

Notes

In strict mode, exceptions from hooks are re-raised and abort execution. In non-strict mode (default), exceptions are logged and execution continues.

Hooks are sorted by priority at execution time for better registration performance when many hooks are registered.

has_hooks(target: Literal['post_ast', 'pre_transform', 'post_transform', 'pre_render', 'post_render', 'document', 'heading', 'paragraph', 'code_block', 'block_quote', 'list', 'list_item', 'table', 'table_row', 'table_cell', 'thematic_break', 'html_block', 'text', 'emphasis', 'strong', 'code', 'link', 'image', 'line_break', 'strikethrough', 'underline', 'superscript', 'subscript', 'html_inline', 'footnote_reference', 'footnote_definition', 'math_inline', 'math_block', 'definition_list', 'definition_term', 'definition_description']) bool

Check if any hooks are registered for a target.

Parameters:

target (HookTarget) – Hook point or node type

Returns:

True if hooks are registered

Return type:

bool

static get_node_type(node: Node) Literal['document', 'heading', 'paragraph', 'code_block', 'block_quote', 'list', 'list_item', 'table', 'table_row', 'table_cell', 'thematic_break', 'html_block', 'text', 'emphasis', 'strong', 'code', 'link', 'image', 'line_break', 'strikethrough', 'underline', 'superscript', 'subscript', 'html_inline', 'footnote_reference', 'footnote_definition', 'math_inline', 'math_block', 'definition_list', 'definition_term', 'definition_description'] | None

Get the node type string for a node instance.

This static method supports subclasses by using isinstance checks rather than exact type matching. If a node is a subclass of a known type, it will be identified by its parent type.

Parameters:

node (Node) – AST node

Returns:

Node type string (e.g., ‘heading’, ‘image’), or None if unknown

Return type:

NodeType or None

Notes

The method iterates through known node types and returns the first match using isinstance checks. This allows custom subclasses to be recognized by their base type. For example, a custom MyImage(Image) subclass will be identified as type ‘image’.

Performance: Uses module-level _NODE_TYPE_MAP constant to avoid reconstructing the mapping on every call (hot path optimization).

This is a static method because it doesn’t depend on instance state, only on the module-level _NODE_TYPE_MAP constant. This allows it to be called without instantiating HookManager.

Examples

>>> from all2md.ast.nodes import Image
>>> img = Image(url="test.png", alt_text="Test")
>>> node_type = HookManager.get_node_type(img)
>>> print(node_type)
'image'
list_hooks() dict[Literal['post_ast', 'pre_transform', 'post_transform', 'pre_render', 'post_render', 'document', 'heading', 'paragraph', 'code_block', 'block_quote', 'list', 'list_item', 'table', 'table_row', 'table_cell', 'thematic_break', 'html_block', 'text', 'emphasis', 'strong', 'code', 'link', 'image', 'line_break', 'strikethrough', 'underline', 'superscript', 'subscript', 'html_inline', 'footnote_reference', 'footnote_definition', 'math_inline', 'math_block', 'definition_list', 'definition_term', 'definition_description'], list[tuple[int, Callable[[Any, HookContext], Any]]]]

List all registered hooks with their priorities.

This method provides a public API for enumerating hooks without exposing the internal _hooks dictionary structure.

Returns:

Dictionary mapping hook targets to lists of (priority, hook) tuples. The returned dictionary is a shallow copy to prevent external modifications to internal state.

Return type:

dict[HookTarget, list[tuple[int, HookCallable]]]

Examples

>>> manager = HookManager()
>>> manager.register_hook('image', my_hook, priority=50)
>>> hooks = manager.list_hooks()
>>> print(hooks)
{'image': [(50, <function my_hook>)]}
clear() None

Clear all registered hooks.

This is primarily useful for testing.

class all2md.transforms.HookContext

Bases: object

Context passed to hook functions.

This class provides hooks with access to document state, metadata, and a shared data dictionary for passing information between hooks and transforms.

Parameters:
  • document (Document) – The current document being processed

  • metadata (dict, default = empty dict) – Document metadata from the source format

  • shared (dict, default = empty dict) – Shared mutable dictionary for passing data between hooks/transforms

  • transform_name (str, optional) – Name of the current transform (for transform hooks)

  • node_path (list[Node], default = empty list) – Path from document root to current node (for node hooks). WARNING: This list is mutated during tree traversal. Not thread-safe.

Examples

Access context in a hook:

>>> def my_hook(node: Image, context: HookContext) -> Image:
...     # Store image count in shared state
...     context.shared['image_count'] = context.shared.get('image_count', 0) + 1
...
...     # Access document metadata
...     if 'author' in context.metadata:
...         print(f"Document by: {context.metadata['author']}")
...
...     return node
document: Document
metadata: dict[str, Any]
shared: dict[str, Any]
transform_name: str | None = None
node_path: list[Node]
get_shared(key: str, default: Any = None) Any

Get a value from shared state.

Parameters:
  • key (str) – Key to retrieve

  • default (Any, optional) – Default value if key not found

Returns:

Value from shared state or default

Return type:

Any

set_shared(key: str, value: Any) None

Set a value in shared state.

Parameters:
  • key (str) – Key to set

  • value (Any) – Value to store

__init__(document: Document, metadata: dict[str, ~typing.Any]=<factory>, shared: dict[str, ~typing.Any]=<factory>, transform_name: str | None = None, node_path: list[Node] = <factory>) None
class all2md.transforms.Pipeline

Bases: object

Pipeline for transforming and rendering AST documents.

This class orchestrates the complete transformation and rendering pipeline, including transform resolution, hook execution, and rendering to output format.

Parameters:
  • transforms (list, optional) – List of transforms to apply. Can be transform names (str) or NodeTransformer instances. Names are resolved via TransformRegistry

  • hooks (dict, optional) – Dictionary mapping hook targets to lists of hook callables

  • renderer (str, type, or renderer instance, optional) – Renderer to use for output. Can be: - Format name string (e.g., “markdown”) - looked up via registry - Renderer class (e.g., MarkdownRenderer) - Renderer instance (e.g., MarkdownRenderer()) Defaults to MarkdownRenderer with default options

  • options (BaseRendererOptions or MarkdownOptions, optional) – Options for rendering (used if renderer is string or class, ignored if instance)

  • progress_callback (ProgressCallback, optional) – Optional callback for progress updates during rendering

  • strict_hooks (bool, default = False) – Enable strict mode for hook exception handling. If True, hook exceptions are re-raised and abort the pipeline. If False (default), exceptions are logged and execution continues.

Examples

Create pipeline with default markdown renderer:

>>> pipeline = Pipeline(
...     transforms=['remove-images'],
...     hooks={'pre_render': [validate]}
... )
>>> output = pipeline.execute(document)

With custom renderer:

>>> from all2md.renderers.markdown import MarkdownRenderer
>>> pipeline = Pipeline(
...     transforms=['remove-images'],
...     renderer=MarkdownRenderer(options=MarkdownRendererOptions(flavor='commonmark'))
... )
>>> output = pipeline.execute(document)

With strict hook mode:

>>> pipeline = Pipeline(
...     transforms=['remove-images'],
...     hooks={'image': [validate_image]},
...     strict_hooks=True  # Hook failures will abort pipeline
... )
>>> output = pipeline.execute(document)

Initialize pipeline with transforms, hooks, renderer, and options.

Parameters:
  • transforms (list, optional) – List of transforms to apply. Can be transform names (str) or NodeTransformer instances

  • hooks (dict, optional) – Dictionary mapping hook targets to lists of hook callables

  • renderer (str, type, renderer instance, False, or None) –

    • str: Format name to look up via registry (e.g., “markdown”)

    • type: Renderer class to instantiate

    • instance: Pre-configured renderer to use

    • False: Skip renderer setup (for AST-only processing)

    • None: Use default MarkdownRenderer (default)

  • options (BaseRendererOptions or MarkdownRendererOptions, optional) – Options for rendering (used if renderer is string or class, ignored if renderer is instance)

  • progress_callback (ProgressCallback, optional) – Optional callback for progress updates during rendering

  • strict_hooks (bool, default = False) – Enable strict mode for hook exception handling. If True, hook exceptions are re-raised and abort the pipeline. If False, exceptions are logged and execution continues.

__init__(transforms: list[str | NodeTransformer] | None = None, hooks: dict[Literal['post_ast', 'pre_transform', 'post_transform', 'pre_render', 'post_render', 'document', 'heading', 'paragraph', 'code_block', 'block_quote', 'list', 'list_item', 'table', 'table_row', 'table_cell', 'thematic_break', 'html_block', 'text', 'emphasis', 'strong', 'code', 'link', 'image', 'line_break', 'strikethrough', 'underline', 'superscript', 'subscript', 'html_inline', 'footnote_reference', 'footnote_definition', 'math_inline', 'math_block', 'definition_list', 'definition_term', 'definition_description'], list[Callable[[Any, HookContext], Any]]] | None = None, renderer: str | type | Any | bool | None = None, options: BaseRendererOptions | MarkdownRendererOptions | None = None, progress_callback: Callable[[ProgressEvent], None] | None = None, strict_hooks: bool = False)

Initialize pipeline with transforms, hooks, renderer, and options.

Parameters:
  • transforms (list, optional) – List of transforms to apply. Can be transform names (str) or NodeTransformer instances

  • hooks (dict, optional) – Dictionary mapping hook targets to lists of hook callables

  • renderer (str, type, renderer instance, False, or None) –

    • str: Format name to look up via registry (e.g., “markdown”)

    • type: Renderer class to instantiate

    • instance: Pre-configured renderer to use

    • False: Skip renderer setup (for AST-only processing)

    • None: Use default MarkdownRenderer (default)

  • options (BaseRendererOptions or MarkdownRendererOptions, optional) – Options for rendering (used if renderer is string or class, ignored if renderer is instance)

  • progress_callback (ProgressCallback, optional) – Optional callback for progress updates during rendering

  • strict_hooks (bool, default = False) – Enable strict mode for hook exception handling. If True, hook exceptions are re-raised and abort the pipeline. If False, exceptions are logged and execution continues.

get_diagnostics() dict[str, Any]

Get diagnostic information about the pipeline configuration.

This method returns structured information about the pipeline’s configuration, useful for debugging, visualization, and documentation.

Returns:

Dictionary containing: - transforms: List of transform names in execution order - hooks: Dictionary of hook targets with their registered hooks and priorities - renderer: Renderer class name - options: Renderer options class name (if available)

Return type:

dict[str, Any]

Examples

>>> pipeline = Pipeline(
...     transforms=['remove-images', 'heading-offset'],
...     hooks={'image': [my_hook], 'pre_render': [validator]},
...     renderer='markdown'
... )
>>> diag = pipeline.get_diagnostics()
>>> print(diag['transforms'])
['RemoveImagesTransform', 'HeadingOffsetTransform']
>>> print(diag['hooks'])
{'image': [{'priority': 0, 'function': 'my_hook'}], ...}
execute(document: Document) str | bytes

Execute complete pipeline.

This method runs the full transformation and rendering pipeline: 1. Execute post_ast hooks 2. Apply transforms (with pre/post transform hooks) 3. Apply element hooks 4. Execute pre_render hooks 5. Render to output format 6. Execute post_render hooks

If a progress_callback is configured, progress events are emitted at each stage of the pipeline.

Parameters:

document (Document) – Document to process

Returns:

Rendered output (type depends on renderer)

Return type:

str or bytes

Examples

>>> pipeline = Pipeline(transforms=['remove-images'])
>>> output = pipeline.execute(document)
class all2md.transforms.HookAwareVisitor

Bases: NodeTransformer

Visitor that applies element hooks during tree traversal.

This visitor extends NodeTransformer to execute registered element hooks for each node type during traversal. It maintains the node path in the context for hooks that need to know the tree structure.

Parameters:
  • hook_manager (HookManager) – Manager containing registered hooks

  • context (HookContext) – Context to pass to hooks

Examples

>>> hook_manager = HookManager()
>>> hook_manager.register_hook('image', my_image_hook)
>>> context = HookContext(document=doc)
>>> visitor = HookAwareVisitor(hook_manager, context)
>>> processed_doc = visitor.transform(doc)

Initialize visitor with hook manager and context.

__init__(hook_manager: HookManager, context: HookContext)

Initialize visitor with hook manager and context.

transform(node: Node) Node | None

Transform node and apply element hooks.

The node is pushed onto node_path before processing and remains there during child traversal, ensuring descendants can see full ancestry. If a hook replaces the node, the path is updated so descendants see the new node in their ancestry.

Parameters:

node (Node) – Node to transform

Returns:

Transformed node, or None if removed by hook

Return type:

Node or None

all2md.transforms.apply(document: Document, transforms: list[str | NodeTransformer] | None = None, hooks: dict[Literal['post_ast', 'pre_transform', 'post_transform', 'pre_render', 'post_render', 'document', 'heading', 'paragraph', 'code_block', 'block_quote', 'list', 'list_item', 'table', 'table_row', 'table_cell', 'thematic_break', 'html_block', 'text', 'emphasis', 'strong', 'code', 'link', 'image', 'line_break', 'strikethrough', 'underline', 'superscript', 'subscript', 'html_inline', 'footnote_reference', 'footnote_definition', 'math_inline', 'math_block', 'definition_list', 'definition_term', 'definition_description'], list[Callable[[Any, HookContext], Any]]] | None = None, progress_callback: Callable[[ProgressEvent], None] | None = None, strict_hooks: bool = False) Document

Apply transforms and hooks to document without rendering.

This function provides AST-only processing by applying transforms and hooks to a document without the rendering stage. It reuses Pipeline internals to maintain consistent hook execution order.

This is useful for developers who want to: - Process AST structures programmatically - Chain multiple transformation passes - Inspect/modify documents before rendering - Build custom rendering pipelines

Parameters:
  • document (Document) – AST document to process

  • transforms (list, optional) – List of transforms to apply. Can be transform names (str) or NodeTransformer instances

  • hooks (dict, optional) – Dictionary mapping hook targets to lists of hook callables. Hook targets can be pipeline stages (‘post_ast’, ‘pre_transform’, ‘post_transform’, ‘pre_render’) or node types (‘image’, ‘link’, etc.)

  • progress_callback (ProgressCallback, optional) – Optional callback for progress updates during processing

  • strict_hooks (bool, default = False) – Enable strict mode for hook exception handling. If True, hook exceptions are re-raised and abort the pipeline. If False (default), exceptions are logged and execution continues.

Returns:

Processed document with transforms and hooks applied

Return type:

Document

Raises:
  • TypeError – If transform is not a string or NodeTransformer

  • ValueError – If transform name is not found or a hook removes the document node

Notes

The following hooks are executed in order: 1. post_ast - After AST creation (document just came from conversion) 2. pre_transform - Before each transform 3. post_transform - After each transform 4. pre_render - Before element hooks (for document-level validation) 5. Element hooks - During tree traversal (image, link, heading, etc.)

The post_render hook is NOT executed since no rendering occurs.

Examples

Apply transforms only:

>>> from all2md import to_ast
>>> from all2md.transforms import apply
>>> doc = to_ast("document.pdf")
>>> processed = apply(doc, transforms=['remove-images'])

Apply hooks only:

>>> def log_image(node, context):
...     print(f"Found image: {node.url}")
...     return node
>>> processed = apply(doc, hooks={'image': [log_image]})

Apply both transforms and hooks:

>>> from all2md.transforms import HeadingOffsetTransform
>>> processed = apply(
...     doc,
...     transforms=[HeadingOffsetTransform(offset=1), 'remove-images'],
...     hooks={
...         'pre_render': [validate_document],
...         'link': [rewrite_links]
...     }
... )

Chain multiple processing passes:

>>> doc1 = apply(doc, transforms=['heading-offset'])
>>> doc2 = apply(doc1, transforms=['remove-images'])
>>> markdown = render(doc2)

With strict hook mode:

>>> processed = apply(
...     doc,
...     hooks={'image': [validate_image]},
...     strict_hooks=True  # Hook failures will abort
... )

With progress tracking:

>>> def progress_handler(event):
...     print(f"{event.event_type}: {event.message}")
>>> processed = apply(doc, transforms=['remove-images'], progress_callback=progress_handler)
all2md.transforms.render(document: Document, transforms: list[str | NodeTransformer] | None = None, hooks: dict[Literal['post_ast', 'pre_transform', 'post_transform', 'pre_render', 'post_render', 'document', 'heading', 'paragraph', 'code_block', 'block_quote', 'list', 'list_item', 'table', 'table_row', 'table_cell', 'thematic_break', 'html_block', 'text', 'emphasis', 'strong', 'code', 'link', 'image', 'line_break', 'strikethrough', 'underline', 'superscript', 'subscript', 'html_inline', 'footnote_reference', 'footnote_definition', 'math_inline', 'math_block', 'definition_list', 'definition_term', 'definition_description'], list[Callable[[Any, HookContext], Any]]] | None = None, renderer: str | type | Any | None = None, options: BaseRendererOptions | MarkdownRendererOptions | None = None, progress_callback: Callable[[ProgressEvent], None] | None = None, strict_hooks: bool = False, **kwargs: Any) str | bytes

Render document with transforms and hooks using specified renderer.

This is the high-level entry point for the transformation pipeline. It creates a Pipeline instance and executes it to produce rendered output.

Parameters:
  • document (Document) – AST document to render

  • transforms (list, optional) – List of transforms to apply. Can be transform names (str) or NodeTransformer instances

  • hooks (dict, optional) – Dictionary mapping hook targets to lists of hook callables. Hook targets can be pipeline stages (‘pre_render’, ‘post_render’, etc.) or node types (‘image’, ‘link’, ‘heading’, etc.)

  • renderer (str, type, or renderer instance, optional) – Renderer to use. Can be: - Format name string (e.g., “markdown”) - looked up via registry - Renderer class (e.g., MarkdownRenderer) - Renderer instance (e.g., MarkdownRenderer()) Defaults to MarkdownRenderer

  • options (BaseRendererOptions or MarkdownOptions, optional) – Options for rendering (used if renderer is string or class)

  • progress_callback (ProgressCallback, optional) – Optional callback for progress updates during rendering

  • strict_hooks (bool, default = False) – Enable strict mode for hook exception handling. If True, hook exceptions are re-raised and abort the pipeline. If False (default), exceptions are logged and execution continues.

  • **kwargs – Additional keyword arguments passed to MarkdownOptions if options is not provided and renderer is markdown

Returns:

Rendered output (type depends on renderer)

Return type:

str or bytes

Raises:
  • TypeError – If transform is not a string or NodeTransformer

  • ValueError – If transform name is not found or a hook removes a required node

Examples

Basic rendering to markdown:
>>> from all2md import to_ast
>>> from all2md.transforms import render
>>> doc = to_ast("document.pdf")
>>> markdown = render(doc)
With transforms by name:
>>> markdown = render(doc, transforms=['remove-images'])
With custom renderer:
>>> from all2md.renderers.markdown import MarkdownRenderer
>>> output = render(
...     doc,
...     renderer=MarkdownRenderer(options=MarkdownRendererOptions(flavor='commonmark'))
... )
With hooks:
>>> def log_image(node, context):
...     print(f"Found image: {node.url}")
...     return node
>>> markdown = render(doc, hooks={'image': [log_image]})
Combined transforms and hooks:
>>> markdown = render(
...     doc,
...     transforms=['heading-offset', 'remove-images'],
...     hooks={
...         'pre_render': [validate_document],
...         'link': [rewrite_links],
...         'post_render': [add_footer]
...     },
...     options=MarkdownRendererOptions(flavor='commonmark')
... )
With MarkdownOptions kwargs:
>>> markdown = render(doc, flavor='gfm', emphasis_symbol='_')
With strict hook mode:
>>> markdown = render(
...     doc,
...     hooks={'image': [validate_image]},
...     strict_hooks=True  # Hook failures will abort
... )
class all2md.transforms.RemoveImagesTransform

Bases: NodeTransformer

Remove all Image nodes from the AST.

This transform removes every Image node it encounters, useful for creating text-only versions of documents or reducing document size.

Examples

>>> transform = RemoveImagesTransform()
>>> doc_without_images = transform.transform(document)
visit_image(node: Image) None

Remove image by returning None.

Parameters:

node (Image) – Image node to remove

Returns:

Always returns None to remove the node

Return type:

None

class all2md.transforms.RemoveNodesTransform

Bases: NodeTransformer

Remove nodes of specified types from the AST.

This is a generic transform that can remove any combination of node types. Useful for stripping specific elements like tables, code blocks, or any other node type.

Parameters:

node_types (list[str]) – List of node type names to remove (e.g., [‘image’, ‘table’, ‘code_block’])

Examples

Remove images and tables:

>>> transform = RemoveNodesTransform(node_types=['image', 'table'])
>>> cleaned_doc = transform.transform(document)

Initialize with list of node types to remove.

Parameters:

node_types (list[str]) – Node type names to remove

Raises:

ValueError – If ‘document’ is in node_types (cannot remove root node), or if any node_type is unknown (typo detection)

__init__(node_types: list[str])

Initialize with list of node types to remove.

Parameters:

node_types (list[str]) – Node type names to remove

Raises:

ValueError – If ‘document’ is in node_types (cannot remove root node), or if any node_type is unknown (typo detection)

transform(node: Node) Node | None

Transform node, removing it if it matches specified types.

Parameters:

node (Node) – Node to potentially remove

Returns:

None if node should be removed, otherwise transformed node

Return type:

Node or None

class all2md.transforms.HeadingOffsetTransform

Bases: NodeTransformer

Shift heading levels by a specified offset.

This transform adjusts all heading levels in the document by adding an offset value. Levels are clamped to the valid range of 1-6.

Parameters:

offset (int, default = 1) – Number of levels to shift (positive to increase, negative to decrease)

Examples

Increase all heading levels by 1 (H1 becomes H2):

>>> transform = HeadingOffsetTransform(offset=1)
>>> new_doc = transform.transform(document)

Decrease all heading levels by 1 (H2 becomes H1):

>>> transform = HeadingOffsetTransform(offset=-1)
>>> new_doc = transform.transform(document)

Initialize with heading level offset.

Parameters:

offset (int) – Heading level adjustment

__init__(offset: int = 1)

Initialize with heading level offset.

Parameters:

offset (int) – Heading level adjustment

visit_heading(node: Heading) Heading

Adjust heading level.

Parameters:

node (Heading) – Heading node to adjust

Returns:

Heading with adjusted level

Return type:

Heading

class all2md.transforms.TitlePromotionTransform

Bases: NodeTransformer

Promote a leading H1 to a document title and shift subsequent headings.

When converting Markdown to Word, a leading # Heading is typically the document title rather than a “Heading 1”. This transform detects a leading H1 (skipping empty / whitespace-only paragraphs before it), marks it with metadata["is_title"] = True, and promotes all subsequent headings by one level (H2 → H1, H3 → H2, etc.) so they style properly under the title.

If the first real content node is not an H1, the document passes through unchanged.

Examples

>>> transform = TitlePromotionTransform()
>>> new_doc = transform.transform(document)
transform(node: Node) Node

Apply title promotion to the document.

Parameters:

node (Node) – Root node (expected to be a Document)

Returns:

Document with title metadata and promoted headings

Return type:

Node

class all2md.transforms.LinkRewriterTransform

Bases: NodeTransformer

Rewrite link URLs using regex pattern matching.

This transform allows flexible URL rewriting using regular expressions. Useful for converting relative links to absolute, updating base URLs, or modifying link schemes.

Parameters:
  • pattern (str) – Regex pattern to match in URLs

  • replacement (str) – Replacement string (can include regex groups like \1, \2)

Raises:

SecurityError – If the pattern contains dangerous constructs that could lead to ReDoS (Regular Expression Denial of Service) attacks

Examples

Convert relative links to absolute:

>>> transform = LinkRewriterTransform(
...     pattern=r'^/docs/',
...     replacement='https://example.com/docs/'
... )
>>> new_doc = transform.transform(document)

Notes

For security reasons, this transform validates user-supplied regex patterns to prevent ReDoS attacks. Patterns with nested quantifiers or excessive backtracking potential are rejected. See validate_user_regex_pattern() for details on what patterns are considered safe.

Initialize with pattern and replacement.

Parameters:
  • pattern (str) – Regex pattern

  • replacement (str) – Replacement string

Raises:

SecurityError – If pattern contains dangerous constructs

__init__(pattern: str, replacement: str)

Initialize with pattern and replacement.

Parameters:
  • pattern (str) – Regex pattern

  • replacement (str) – Replacement string

Raises:

SecurityError – If pattern contains dangerous constructs

visit_link(node: Link) Link

Rewrite link URL if it matches pattern.

Parameters:

node (Link) – Link node to potentially rewrite

Returns:

Link with potentially rewritten URL

Return type:

Link

class all2md.transforms.TextReplacerTransform

Bases: NodeTransformer

Find and replace text in Text nodes.

This transform performs simple text replacement across all Text nodes in the document. For regex-based replacement, use a custom transform.

Parameters:
  • find (str) – Text to find

  • replace (str) – Replacement text

Examples

Replace all instances of “TODO”:

>>> transform = TextReplacerTransform(find="TODO", replace="DONE")
>>> new_doc = transform.transform(document)

Initialize with find and replace strings.

Parameters:
  • find (str) – Text to find

  • replace (str) – Replacement text

__init__(find: str, replace: str)

Initialize with find and replace strings.

Parameters:
  • find (str) – Text to find

  • replace (str) – Replacement text

visit_text(node: Text) Text

Replace text in node content.

Parameters:

node (Text) – Text node to process

Returns:

Text node with replacements applied

Return type:

Text

class all2md.transforms.AddHeadingIdsTransform

Bases: NodeTransformer

Generate and add unique IDs to heading nodes.

This transform creates slugified IDs from heading text and adds them to the heading metadata. These IDs can be used by renderers to create HTML anchors for linkable sections.

Parameters:
  • id_prefix (str, default = "") – Prefix to add to all generated IDs

  • separator (str, default = "-") – Separator for multi-word slugs and duplicate handling

Examples

Basic usage:

>>> transform = AddHeadingIdsTransform()
>>> new_doc = transform.transform(document)
>>> # "My Heading" -> metadata['id'] = "my-heading"

With prefix:

>>> transform = AddHeadingIdsTransform(id_prefix="doc-")
>>> new_doc = transform.transform(document)
>>> # "My Heading" -> metadata['id'] = "doc-my-heading"

Initialize with prefix and separator.

Parameters:
  • id_prefix (str) – Prefix for IDs

  • separator (str) – Word separator

__init__(id_prefix: str = '', separator: str = '-')

Initialize with prefix and separator.

Parameters:
  • id_prefix (str) – Prefix for IDs

  • separator (str) – Word separator

visit_heading(node: Heading) Heading

Add unique ID to heading.

Parameters:

node (Heading) – Heading to process

Returns:

Heading with ID in metadata

Return type:

Heading

class all2md.transforms.RemoveBoilerplateTextTransform

Bases: NodeTransformer

Remove paragraphs matching common boilerplate patterns.

This transform removes paragraphs whose text matches predefined patterns like “CONFIDENTIAL”, “Page X of Y”, etc. Useful for cleaning up corporate documents and reports.

Parameters:
  • patterns (list[str], optional) – List of regex patterns to match (default: common boilerplate)

  • skip_if_truncated (bool, default = True) – If True, skip pattern matching when text exceeds MAX_TEXT_LENGTH_FOR_REGEX to avoid false positives with end-anchored patterns ($). If False, match against truncated text (may produce incorrect results with anchors).

Raises:

SecurityError – If any user-supplied pattern contains dangerous constructs that could lead to ReDoS (Regular Expression Denial of Service) attacks

Examples

Use default patterns:

>>> transform = RemoveBoilerplateTextTransform()
>>> cleaned_doc = transform.transform(document)

Custom patterns with anchoring:

>>> transform = RemoveBoilerplateTextTransform(
...     patterns=[r"^DRAFT$", r"^INTERNAL ONLY$", r"^Page \d+ of \d+$"]
... )
>>> cleaned_doc = transform.transform(document)

Allow matching truncated text (not recommended):

>>> transform = RemoveBoilerplateTextTransform(skip_if_truncated=False)
>>> cleaned_doc = transform.transform(document)

Notes

Pattern Matching Semantics: This transform uses Python’s re.match(), which implicitly anchors at the start of the string (equivalent to adding ^ at the beginning). For exact matching of entire paragraphs, patterns should include an end anchor ($). For example:

  • r”CONFIDENTIAL” - Matches paragraphs starting with “CONFIDENTIAL”

  • r”CONFIDENTIAL$” - Matches paragraphs that are exactly “CONFIDENTIAL” or start with “CONFIDENTIAL” followed by only whitespace

  • r”^CONFIDENTIAL$” - Explicitly anchored (redundant ^, but clearer)

If you need to match patterns anywhere in the text (not just at the start), use re.search() semantics by implementing a custom transform.

Security: For security reasons, this transform validates user-supplied regex patterns to prevent ReDoS attacks. Default patterns are pre-validated and trusted. Patterns with nested quantifiers or excessive backtracking potential are rejected. See validate_user_regex_pattern() for details.

Truncation Behavior: Text longer than MAX_TEXT_LENGTH_FOR_REGEX (10000 characters) is truncated before matching for ReDoS protection. With skip_if_truncated=True (default), such paragraphs are preserved to avoid false positives from patterns using end anchors ($). This is safer but may miss some boilerplate. With skip_if_truncated=False, matching proceeds on truncated text, which may incorrectly match or not match patterns with anchors.

Initialize with patterns.

Parameters:
  • patterns (list[str] or None) – Regex patterns to match (None uses defaults)

  • skip_if_truncated (bool) – Skip matching when text is truncated (safer default)

Raises:

SecurityError – If any user-supplied pattern contains dangerous constructs

__init__(patterns: list[str] | None = None, skip_if_truncated: bool = True)

Initialize with patterns.

Parameters:
  • patterns (list[str] or None) – Regex patterns to match (None uses defaults)

  • skip_if_truncated (bool) – Skip matching when text is truncated (safer default)

Raises:

SecurityError – If any user-supplied pattern contains dangerous constructs

visit_paragraph(node: Paragraph) Paragraph | None

Remove paragraph if it matches boilerplate pattern.

Parameters:

node (Paragraph) – Paragraph to check

Returns:

None if matches boilerplate, otherwise paragraph

Return type:

Paragraph or None

class all2md.transforms.AddConversionTimestampTransform

Bases: NodeTransformer

Add conversion timestamp to document metadata.

This transform adds a timezone-aware UTC timestamp to the document metadata indicating when the conversion occurred. Useful for tracking document versions and conversion history. All timestamps are generated in UTC to ensure consistency across different time zones.

Parameters:
  • field_name (str, default = "conversion_timestamp") – Metadata field name for the timestamp

  • timestamp_format (str, default = "iso") – Timestamp format: “iso” for ISO 8601 with timezone, “unix” for Unix timestamp, or any strftime format string

  • timespec (str, default = "seconds") – Time precision for ISO format timestamps. Valid values are: - “auto”: Automatic precision - “hours”: Hours precision - “minutes”: Minutes precision - “seconds”: Seconds precision (default, reduces noisy diffs) - “milliseconds”: Milliseconds precision - “microseconds”: Microseconds precision Only applies when timestamp_format=”iso”. Ignored for other formats.

Examples

Add ISO 8601 timestamp with second precision (default):

>>> transform = AddConversionTimestampTransform()
>>> new_doc = transform.transform(document)
>>> # metadata['conversion_timestamp'] = "2025-01-01T12:00:00+00:00"

Add ISO 8601 timestamp with microsecond precision:

>>> transform = AddConversionTimestampTransform(timespec="microseconds")
>>> new_doc = transform.transform(document)
>>> # metadata['conversion_timestamp'] = "2025-01-01T12:00:00.123456+00:00"

Add Unix timestamp:

>>> transform = AddConversionTimestampTransform(timestamp_format="unix")
>>> new_doc = transform.transform(document)
>>> # metadata['conversion_timestamp'] = "1735732800"

Custom strftime format:

>>> transform = AddConversionTimestampTransform(
...     field_name="converted_at",
...     timestamp_format="%Y-%m-%d %H:%M:%S UTC"
... )
>>> new_doc = transform.transform(document)
>>> # metadata['converted_at'] = "2025-01-01 12:00:00 UTC"

Notes

All timestamps are generated in UTC (Coordinated Universal Time) using datetime.now(timezone.utc). This ensures consistent timestamps regardless of the server’s local timezone.

The default timespec=”seconds” is recommended to reduce noisy git diffs when regenerating documents, as subsecond precision is rarely needed for document conversion timestamps.

Initialize with field name, format, and time precision.

Parameters:
  • field_name (str) – Metadata field name

  • timestamp_format (str) – Timestamp format

  • timespec (str) – Time precision for ISO format (default: “seconds”)

__init__(field_name: str = 'conversion_timestamp', timestamp_format: str = 'iso', timespec: str = 'seconds')

Initialize with field name, format, and time precision.

Parameters:
  • field_name (str) – Metadata field name

  • timestamp_format (str) – Timestamp format

  • timespec (str) – Time precision for ISO format (default: “seconds”)

visit_document(node: Document) Document

Add timestamp to document metadata.

Parameters:

node (Document) – Document to process

Returns:

Document with timestamp in metadata

Return type:

Document

class all2md.transforms.CalculateWordCountTransform

Bases: NodeTransformer

Calculate word and character counts and add to metadata.

This transform traverses the entire document, extracts all text, and calculates word and character counts. The counts are added to the document metadata.

Parameters:
  • word_field (str, default = "word_count") – Metadata field name for word count

  • char_field (str, default = "char_count") – Metadata field name for character count

Examples

Basic usage:

>>> transform = CalculateWordCountTransform()
>>> new_doc = transform.transform(document)
>>> # metadata['word_count'] = 150
>>> # metadata['char_count'] = 890

Custom field names:

>>> transform = CalculateWordCountTransform(
...     word_field="words",
...     char_field="characters"
... )
>>> new_doc = transform.transform(document)

Notes

Character Count Behavior: The char_count metric represents the length of normalized text extracted from the AST, not the original document’s character count. During text extraction, text fragments from separate AST nodes are joined with spaces, which may introduce synthetic spacing not present in the original document. For example, if the AST contains two adjacent Text nodes Text("hello") and Text("world"), the extracted text will be "hello world" (11 characters including the inserted space), even though the original text nodes only contain 10 characters total.

This normalized approach provides consistent metrics across different AST structures, though it may not exactly match the original document’s byte count. Word count is calculated by splitting the normalized text on whitespace, which is generally more robust to these variations.

Initialize with field names.

Parameters:
  • word_field (str) – Field name for word count

  • char_field (str) – Field name for character count

__init__(word_field: str = 'word_count', char_field: str = 'char_count')

Initialize with field names.

Parameters:
  • word_field (str) – Field name for word count

  • char_field (str) – Field name for character count

visit_document(node: Document) Document

Calculate counts and add to metadata.

Parameters:

node (Document) – Document to analyze

Returns:

Document with counts in metadata

Return type:

Document

class all2md.transforms.AddAttachmentFootnotesTransform

Bases: NodeTransformer

Add footnote definitions for attachment references.

When attachments are processed with alt_text_mode=”footnote”, they generate footnote-style references like ![image][^label] but no corresponding definitions. This transform scans the AST for such references and adds FootnoteDefinition nodes with source information.

Parameters:
  • section_title (str or None, default "Attachments") – Title for the footnote section heading. If None, no heading is added.

  • add_definitions_for_images (bool, default True) – Add definitions for image footnote references

  • add_definitions_for_links (bool, default True) – Add definitions for link footnote references

Examples

Add footnote definitions after conversion:

>>> transform = AddAttachmentFootnotesTransform()
>>> doc_with_footnotes = transform.transform(document)

Custom section title:

>>> transform = AddAttachmentFootnotesTransform(section_title="Image Sources")
>>> doc_with_footnotes = transform.transform(document)

Notes

This transform works by: 1. Collecting all Image and Link nodes with empty URLs (indicates footnote mode) 2. Extracting footnote labels from alt text or title 3. Handling duplicate labels by appending numeric suffixes (-2, -3, etc.) 4. Creating FootnoteDefinition nodes with source information 5. Appending definitions to the end of the document

Duplicate labels are resolved using a counter mechanism similar to heading ID generation. When a label appears multiple times, subsequent occurrences get a numeric suffix to ensure unique footnote identifiers

Initialize transform with options.

Parameters:
  • section_title (str or None) – Heading for footnotes section

  • add_definitions_for_images (bool) – Whether to process image footnotes

  • add_definitions_for_links (bool) – Whether to process link footnotes

__init__(section_title: str | None = 'Attachments', add_definitions_for_images: bool = True, add_definitions_for_links: bool = True)

Initialize transform with options.

Parameters:
  • section_title (str or None) – Heading for footnotes section

  • add_definitions_for_images (bool) – Whether to process image footnotes

  • add_definitions_for_links (bool) – Whether to process link footnotes

visit_document(node: Document) Document

Process document and add footnote definitions.

Parameters:

node (Document) – Document to process

Returns:

Document with footnote definitions added

Return type:

Document

class all2md.transforms.GenerateTocTransform

Bases: NodeTransformer

Generate a table of contents from document headings.

This transform extracts headings from the document and generates a nested list representing the table of contents. The TOC can be placed at the top or bottom of the document.

Parameters:
  • title (str, default = "Table of Contents") – Title for the TOC section

  • max_depth (int, default = 3) – Maximum heading level to include (1-6)

  • position ({"top", "bottom"}, default = "top") – Position to insert the TOC

  • add_links (bool, default = True) – Whether to create links to headings (requires heading IDs)

  • separator (str, default = "-") – Separator for generating heading IDs when not present

  • set_ids_if_missing (bool, default = False) – If True, inject generated IDs into heading metadata when missing. This ensures renderers create anchors matching the TOC links. If False (default), IDs are only used for TOC links.

Examples

Basic usage:

>>> transform = GenerateTocTransform()
>>> doc_with_toc = transform.transform(document)

Custom depth and position:

>>> transform = GenerateTocTransform(
...     title="Contents",
...     max_depth=2,
...     position="bottom"
... )
>>> doc_with_toc = transform.transform(document)

Inject IDs into headings:

>>> transform = GenerateTocTransform(set_ids_if_missing=True)
>>> doc_with_toc = transform.transform(document)
>>> # Headings now have 'id' in metadata for renderer anchors

Notes

This transform works best when combined with AddHeadingIdsTransform, which generates unique IDs for headings that can be used for navigation. If headings don’t have IDs, the transform will generate slugified IDs on-the-fly for link targets.

ID Injection: With set_ids_if_missing=True, generated IDs are injected into heading metadata so renderers can create matching anchors. This is recommended when not using AddHeadingIdsTransform. Alternatively, run AddHeadingIdsTransform before GenerateTocTransform to ensure all headings have IDs upfront.

Initialize with TOC generation options.

Parameters:
  • title (str) – TOC section title

  • max_depth (int) – Maximum heading level (1-6)

  • position (str) – Position for TOC (“top” or “bottom”)

  • add_links (bool) – Whether to generate links

  • separator (str) – Separator for ID generation

  • set_ids_if_missing (bool) – Inject generated IDs into heading metadata

Raises:

ValueError – If max_depth is not between 1 and 6, or position is invalid

__init__(title: str = 'Table of Contents', max_depth: int = 3, position: str = 'top', add_links: bool = True, separator: str = '-', set_ids_if_missing: bool = False)

Initialize with TOC generation options.

Parameters:
  • title (str) – TOC section title

  • max_depth (int) – Maximum heading level (1-6)

  • position (str) – Position for TOC (“top” or “bottom”)

  • add_links (bool) – Whether to generate links

  • separator (str) – Separator for ID generation

  • set_ids_if_missing (bool) – Inject generated IDs into heading metadata

Raises:

ValueError – If max_depth is not between 1 and 6, or position is invalid

visit_document(node: Document) Document

Generate TOC and add to document.

Parameters:

node (Document) – Document to process

Returns:

Document with TOC added

Return type:

Document

For transforms module documentation organized by functionality, see Transforms.