all2md.transforms
Transform system for AST manipulation.
This package provides a plugin-based transformation system for manipulating AST structures before rendering. It includes:
Transform registry for plugin discovery
Hook system for pipeline interception
Metadata classes for transform description
Built-in transforms for common operations
The transform system uses Python entry points for plugin discovery, allowing third-party packages to register custom transforms.
Examples
Use a transform by name:
>>> from all2md import to_ast
>>> from all2md.transforms import render
>>> doc = to_ast("document.pdf")
>>> markdown = render(doc, transforms=['remove-images'])
Use a transform instance with parameters:
>>> from all2md.transforms import render, HeadingOffsetTransform
>>> markdown = render(
... doc,
... transforms=[HeadingOffsetTransform(offset=1)]
... )
Register a custom transform:
>>> from all2md.transforms import transform_registry, TransformMetadata
>>> from all2md.ast.transforms import NodeTransformer
>>>
>>> class MyTransform(NodeTransformer):
... pass
>>>
>>> metadata = TransformMetadata(
... name="my-transform",
... description="My custom transform",
... transformer_class=MyTransform
... )
>>> transform_registry.register(metadata)
Use hooks for element-specific processing:
>>> def log_images(node, context):
... print(f"Image: {node.url}")
... return node
>>>
>>> from all2md.transforms import HookManager
>>> hooks = {'image': [log_images]}
>>> markdown = render(doc, hooks=hooks)
- class all2md.transforms.TransformMetadata
Bases:
objectMetadata for a transform.
This class describes a transform for registration, discovery, and CLI integration. It follows the same pattern as ConverterMetadata for consistency.
- Parameters:
name (str) – Unique identifier for the transform (e.g., “remove-images”)
description (str) – Human-readable description of what the transform does
transformer_class (type[NodeTransformer]) – The transform class (must inherit from NodeTransformer)
parameters (dict[str, ParameterSpec], default = empty dict) – Parameters accepted by the transform constructor
priority (int, default = 100) – Execution priority (lower runs first). Used for dependency ordering
dependencies (list[str], default = empty list) – Names of transforms that must run before this one
version (str, default = "1.0.0") – Transform version (semantic versioning)
author (str, optional) – Transform author or maintainer
tags (list[str], default = empty list) – Tags for categorization (e.g., [“images”, “cleanup”])
Examples
- Basic transform metadata:
>>> metadata = TransformMetadata( ... name="remove-images", ... description="Remove all image nodes from the AST", ... transformer_class=RemoveImagesTransform ... )
- Transform with parameters:
>>> metadata = TransformMetadata( ... name="heading-offset", ... description="Shift heading levels by an offset", ... transformer_class=HeadingOffsetTransform, ... parameters={ ... 'offset': ParameterSpec( ... type=int, ... default=1, ... help="Number of levels to shift (positive or negative)" ... ) ... } ... )
- Transform with dependencies:
>>> metadata = TransformMetadata( ... name="sanitize-links", ... description="Sanitize and validate all links", ... transformer_class=SanitizeLinksTransform, ... dependencies=["extract-metadata"], ... priority=200 ... )
- name: str
- description: str
- transformer_class: Type[NodeTransformer]
- parameters: dict[str, ParameterSpec]
- priority: int = 100
- dependencies: list[str]
- version: str = '1.0.0'
- author: str | None = None
- tags: list[str]
- create_instance(strict: bool = False, **kwargs: Any) NodeTransformer
Create an instance of the transform with given parameters.
- Parameters:
strict (bool, default = False) – If True, log warnings for unknown parameters to aid debugging
**kwargs – Parameters to pass to the transform constructor
- Returns:
Transform instance
- Return type:
- Raises:
ValueError – If required parameters are missing or validation fails
Examples
>>> metadata = TransformMetadata( ... name="test", ... description="Test transform", ... transformer_class=MyTransform, ... parameters={'threshold': ParameterSpec(type=int, default=10)} ... ) >>> instance = metadata.create_instance(threshold=20)
- get_parameter_names() list[str]
Get list of parameter names.
- Returns:
Parameter names
- Return type:
list[str]
- has_parameter(name: str) bool
Check if transform has a parameter.
- Parameters:
name (str) – Parameter name
- Returns:
True if parameter exists
- Return type:
bool
- __init__(name: str, description: str, transformer_class: ~typing.Type[~all2md.ast.transforms.NodeTransformer], parameters: dict[str, ~all2md.transforms.metadata.ParameterSpec] = <factory>, priority: int = 100, dependencies: list[str] = <factory>, version: str = '1.0.0', author: str | None = None, tags: list[str] = <factory>) None
- class all2md.transforms.ParameterSpec
Bases:
objectSpecification for a transform parameter.
This class describes a single parameter accepted by a transform, including type information, default values, and metadata for CLI generation.
- Parameters:
type (type) – Python type of the parameter (e.g., int, str, bool)
default (Any, optional) – Default value if parameter is not provided
help (str, optional) – Help text describing the parameter (used in CLI –help)
cli_flag (str, optional) – Custom CLI flag name (e.g., ‘–my-param’). If None, auto-generated from parameter name
required (bool, default = False) – Whether this parameter is required
choices (list, optional) – List of valid choices for this parameter
validator (callable, optional) – Custom validation function: takes value, returns bool or raises ValueError
element_type (type, optional) – For list parameters, the expected type of list elements (e.g., str, int)
expose (bool, optional) – Whether to expose this parameter on the CLI when no explicit
cli_flagis provided.Nonedefers to global defaults (currentlyFalse).
Examples
- Simple parameter:
>>> param = ParameterSpec(type=int, default=10, help="Threshold value")
- Parameter with choices:
>>> param = ParameterSpec( ... type=str, ... default="auto", ... choices=["auto", "manual", "disabled"], ... help="Processing mode" ... )
- Required parameter with validation:
>>> def validate_positive(value): ... if value <= 0: ... raise ValueError("Must be positive") ... return True >>> param = ParameterSpec( ... type=int, ... required=True, ... validator=validate_positive, ... help="Positive integer" ... )
- List parameter with element type validation:
>>> param = ParameterSpec( ... type=list, ... element_type=str, ... default=["image", "table"], ... help="Node types to remove" ... )
- type: Type
- default: Any = None
- help: str = ''
- cli_flag: str | None = None
- required: bool = False
- choices: list[Any] | None = None
- validator: Callable[[Any], bool] | None = None
- element_type: Type | None = None
- expose: bool | None = None
- DEFAULT_EXPOSE: ClassVar[bool] = False
- validate(value: Any) bool
Validate a parameter value.
- Parameters:
value (Any) – Value to validate. For list types, tuples are accepted and coerced to lists automatically.
- Returns:
True if valid
- Return type:
bool
- Raises:
ValueError – If value is invalid
Notes
When validating list parameters, this method accepts both list and tuple types. Tuples are automatically coerced to lists to accommodate CLI parsers that often yield tuples. The coercion is transparent to the caller.
- get_cli_flag(param_name: str) str
Get CLI flag name for this parameter.
- Parameters:
param_name (str) – Parameter name from the transform
- Returns:
CLI flag (e.g., ‘–threshold’)
- Return type:
str
- should_expose(default: bool | None = None) bool
Determine whether this parameter should surface in the CLI.
- get_dest_name(param_name: str, transform_name: str) str
Get argparse dest name for this parameter.
This provides a consistent naming convention for transform parameters in the argparse namespace, avoiding conflicts between transforms.
- Parameters:
param_name (str) – Parameter name from the transform
transform_name (str) – Name of the transform
- Returns:
Destination name for argparse (e.g., ‘heading_offset_transform_offset’)
- Return type:
str
Notes
The dest name is constructed to avoid collisions: - Format: f’{transform_name}_{param_name}’ - Hyphens converted to underscores for valid Python identifiers - Example: ‘heading-offset’ transform, ‘offset’ param -> ‘heading_offset_offset’
- get_argparse_kwargs(param_name: str, transform_name: str) dict
Generate argparse kwargs for this parameter.
This centralizes the logic for converting ParameterSpec to argparse add_argument() kwargs, ensuring consistency between CLI argument definition and parameter extraction.
- Parameters:
param_name (str) – Parameter name from the transform
transform_name (str) – Name of the transform (for help text)
- Returns:
Keyword arguments for argparse.ArgumentParser.add_argument()
- Return type:
dict
Notes
The returned dict includes: - ‘action’: Tracking action class (TrackingStoreAction, etc.) - ‘type’: Python type for conversion (if applicable) - ‘default’: Default value (if applicable) - ‘help’: Help text - ‘choices’: Valid choices (if specified) - ‘nargs’: Argument count (for list types) - ‘dest’: Destination name in namespace
Examples
>>> param = ParameterSpec(type=int, default=10, help="Threshold") >>> kwargs = param.get_argparse_kwargs('threshold', 'my-transform') >>> # Returns: {'action': TrackingStoreAction, 'type': int, >>> # 'default': 10, 'help': 'Threshold', 'dest': 'my_transform_threshold'}
- extract_value(namespace: Any, dest: str) tuple[Any, bool]
Extract parameter value from parsed argparse namespace.
This handles extracting the value and determining if it was explicitly provided by the user (vs. being a default value).
- Parameters:
namespace (argparse.Namespace) – Parsed command line arguments
dest (str) – Destination name in the namespace (from get_dest_name())
- Returns:
Tuple of (value, was_provided) where: - value: The parameter value (or None if not provided) - was_provided: True if user explicitly provided this value
- Return type:
tuple[Any, bool]
Notes
This method checks the _provided_args set in the namespace to determine if a value was explicitly provided by the user. Only explicitly provided values should be passed to transform constructors.
Examples
>>> namespace = argparse.Namespace( ... my_transform_threshold=20, ... _provided_args={'my_transform_threshold'} ... ) >>> param = ParameterSpec(type=int, default=10) >>> value, provided = param.extract_value(namespace, 'my_transform_threshold') >>> # Returns: (20, True)
- __init__(type: Type, default: Any = None, help: str = '', cli_flag: str | None = None, required: bool = False, choices: list[Any] | None = None, validator: Callable[[Any], bool] | None = None, element_type: Type | None = None, expose: bool | None = None) None
- class all2md.transforms.TransformRegistry
Bases:
objectRegistry for managing AST transforms.
This singleton class provides a central registry for all transforms, handling: - Transform registration and discovery - Entry point plugin loading - Dependency resolution - Lazy instantiation
The registry automatically discovers transforms via the all2md.transforms entry point group on first access.
Notes
The preferred way to access the registry is by importing the global registry instance rather than instantiating this class directly. While instantiation works due to the singleton pattern, importing registry is more explicit.
Examples
- Use the global registry instance (preferred):
>>> from all2md.transforms import transform_registry >>> transform_registry.register(metadata)
- Get a transform instance:
>>> from all2md.transforms import transform_registry >>> transformer = transform_registry.get_transform("remove-images")
- List all available transforms:
>>> from all2md.transforms import transform_registry >>> transforms = transform_registry.list_transforms()
Create or return singleton instance.
- static __new__(cls) TransformRegistry
Create or return singleton instance.
- register(metadata: TransformMetadata) None
Register a transform with its metadata.
- Parameters:
metadata (TransformMetadata) – Transform metadata to register
Notes
If a transform with the same name is already registered, it will be overwritten and a warning will be logged.
Examples
>>> metadata = TransformMetadata( ... name="my-transform", ... description="My custom transform", ... transformer_class=MyTransform ... ) >>> transform_registry = TransformRegistry() >>> transform_registry.register(metadata)
- unregister(name: str) bool
Unregister a transform.
- Parameters:
name (str) – Transform name to unregister
- Returns:
True if transform was unregistered, False if not found
- Return type:
bool
- get_metadata(name: str) TransformMetadata
Get metadata for a transform.
- Parameters:
name (str) – Transform name
- Returns:
Transform metadata
- Return type:
- Raises:
KeyError – If transform is not registered
- get_transform(name: str, **kwargs: Any) NodeTransformer
Get a transform instance by name.
- Parameters:
name (str) – Transform name
**kwargs – Parameters to pass to transform constructor
- Returns:
Transform instance
- Return type:
- Raises:
KeyError – If transform is not registered
ValueError – If parameters are invalid
Examples
>>> transform_registry = TransformRegistry() >>> transformer = transform_registry.get_transform("heading-offset", offset=2)
- has_transform(name: str) bool
Check if a transform is registered.
- Parameters:
name (str) – Transform name
- Returns:
True if transform is registered
- Return type:
bool
- list_transforms(tags: list[str] | None = None) list[str]
List all registered transform names.
- Parameters:
tags (list[str], optional) – Filter by tags. If provided, only transforms with at least one matching tag are returned
- Returns:
List of transform names, sorted alphabetically
- Return type:
list[str]
Examples
- List all transforms:
>>> names = transform_registry.list_transforms()
- List transforms with specific tags:
>>> image_transforms = transform_registry.list_transforms(tags=["images"])
- discover_plugins() int
Discover and register transforms from entry points.
This method scans for plugins using the all2md.transforms entry point group and registers all discovered transforms.
- Returns:
Number of transforms discovered and registered
- Return type:
int
Examples
>>> transform_registry = TransformRegistry() >>> count = transform_registry.discover_plugins() >>> print(f"Discovered {count} transforms")
- resolve_dependencies(transform_names: list[str]) list[str]
Resolve transform dependencies and return execution order.
This method performs topological sorting using Kahn’s algorithm to determine the correct execution order based on dependencies and priorities. Priority is used as a tiebreaker among transforms with no pending dependencies.
- Parameters:
transform_names (list[str]) – List of transform names to order
- Returns:
Transform names in execution order (dependencies first)
- Return type:
list[str]
- Raises:
ValueError – If circular dependencies are detected or a dependency is not found
Examples
>>> transform_registry = TransformRegistry() >>> ordered = transform_registry.resolve_dependencies([ ... "sanitize-links", # depends on "extract-metadata" ... "extract-metadata" ... ]) >>> print(ordered) ['extract-metadata', 'sanitize-links']
- clear() None
Clear all registered transforms.
This is primarily useful for testing.
- class all2md.transforms.HookManager
Bases:
objectManager for registering and executing hooks.
This class provides a central registry for hooks at various pipeline stages and for specific node types.
- Parameters:
strict (bool, default = False) – If True, hook exceptions are re-raised and abort the pipeline. If False (default), exceptions are logged and execution continues.
Examples
- Create a hook manager:
>>> manager = HookManager()
- Create a strict hook manager:
>>> manager = HookManager(strict=True)
- Register a pipeline hook:
>>> def pre_render_hook(doc, context): ... print("About to render") ... return doc >>> manager.register_hook('pre_render', pre_render_hook)
- Register a node hook:
>>> def image_hook(node, context): ... print(f"Processing image: {node.url}") ... return node >>> manager.register_hook('image', image_hook)
- Execute hooks:
>>> context = HookContext(document=my_doc) >>> result = manager.execute_hooks('pre_render', my_doc, context)
Notes
In strict mode (strict=True), any exception raised by a hook will be re-raised and abort the pipeline. This is useful for debugging or when hook failures should be treated as critical errors.
In non-strict mode (strict=False, the default), exceptions are logged with full traceback but execution continues with subsequent hooks. This provides a fail-safe default that prevents a single problematic hook from breaking the entire pipeline.
Thread Safety
WARNING: HookManager instances are NOT thread-safe. Hook registration and execution use shared mutable state without synchronization.
For safe concurrent usage: - Create a separate HookManager instance per thread/pipeline (recommended) - Each Pipeline instance creates its own HookManager (default behavior) - If sharing across threads, wrap access with external locks (e.g., threading.Lock)
Initialize the hook manager.
- param strict:
Enable strict mode for hook exception handling
- type strict:
bool, default = False
- __init__(strict: bool = False) None
Initialize the hook manager.
- Parameters:
strict (bool, default = False) – Enable strict mode for hook exception handling
- register_hook(target: Literal['post_ast', 'pre_transform', 'post_transform', 'pre_render', 'post_render', 'document', 'heading', 'paragraph', 'code_block', 'block_quote', 'list', 'list_item', 'table', 'table_row', 'table_cell', 'thematic_break', 'html_block', 'text', 'emphasis', 'strong', 'code', 'link', 'image', 'line_break', 'strikethrough', 'underline', 'superscript', 'subscript', 'html_inline', 'footnote_reference', 'footnote_definition', 'math_inline', 'math_block', 'definition_list', 'definition_term', 'definition_description'], hook: Callable[[Any, HookContext], Any], priority: int = 100) None
Register a hook for a target.
- Parameters:
target (HookTarget) – Hook point or node type to hook into
hook (callable) – Hook function with signature: (obj, context) -> obj
priority (int, default = 100) – Execution priority (lower runs first)
Notes
Hooks for the same target are executed in priority order (lower first). If priorities are equal, hooks run in registration order.
Sorting is deferred until execution time for better performance when registering many hooks.
Examples
>>> manager = HookManager() >>> manager.register_hook('image', my_image_hook, priority=50)
- unregister_hook(target: Literal['post_ast', 'pre_transform', 'post_transform', 'pre_render', 'post_render', 'document', 'heading', 'paragraph', 'code_block', 'block_quote', 'list', 'list_item', 'table', 'table_row', 'table_cell', 'thematic_break', 'html_block', 'text', 'emphasis', 'strong', 'code', 'link', 'image', 'line_break', 'strikethrough', 'underline', 'superscript', 'subscript', 'html_inline', 'footnote_reference', 'footnote_definition', 'math_inline', 'math_block', 'definition_list', 'definition_term', 'definition_description'], hook: Callable[[Any, HookContext], Any]) bool
Unregister a hook.
- Parameters:
target (HookTarget) – Hook point or node type
hook (callable) – Hook function to remove
- Returns:
True if hook was found and removed
- Return type:
bool
- execute_hooks(target: Literal['post_ast', 'pre_transform', 'post_transform', 'pre_render', 'post_render', 'document', 'heading', 'paragraph', 'code_block', 'block_quote', 'list', 'list_item', 'table', 'table_row', 'table_cell', 'thematic_break', 'html_block', 'text', 'emphasis', 'strong', 'code', 'link', 'image', 'line_break', 'strikethrough', 'underline', 'superscript', 'subscript', 'html_inline', 'footnote_reference', 'footnote_definition', 'math_inline', 'math_block', 'definition_list', 'definition_term', 'definition_description'], obj: Any, context: HookContext) Any
Execute all hooks for a target.
Hooks are executed in priority order. Each hook receives the result from the previous hook. If a hook returns None, the object is removed (for node hooks).
- Parameters:
target (HookTarget) – Hook point or node type
obj (Any) – Object to process (Document or Node)
context (HookContext) – Hook context
- Returns:
Processed object (or None if removed by a hook)
- Return type:
Any
- Raises:
Exception – Any exception from hooks if strict mode is enabled
Examples
>>> context = HookContext(document=doc) >>> result = manager.execute_hooks('image', image_node, context)
Notes
In strict mode, exceptions from hooks are re-raised and abort execution. In non-strict mode (default), exceptions are logged and execution continues.
Hooks are sorted by priority at execution time for better registration performance when many hooks are registered.
- has_hooks(target: Literal['post_ast', 'pre_transform', 'post_transform', 'pre_render', 'post_render', 'document', 'heading', 'paragraph', 'code_block', 'block_quote', 'list', 'list_item', 'table', 'table_row', 'table_cell', 'thematic_break', 'html_block', 'text', 'emphasis', 'strong', 'code', 'link', 'image', 'line_break', 'strikethrough', 'underline', 'superscript', 'subscript', 'html_inline', 'footnote_reference', 'footnote_definition', 'math_inline', 'math_block', 'definition_list', 'definition_term', 'definition_description']) bool
Check if any hooks are registered for a target.
- Parameters:
target (HookTarget) – Hook point or node type
- Returns:
True if hooks are registered
- Return type:
bool
- static get_node_type(node: Node) Literal['document', 'heading', 'paragraph', 'code_block', 'block_quote', 'list', 'list_item', 'table', 'table_row', 'table_cell', 'thematic_break', 'html_block', 'text', 'emphasis', 'strong', 'code', 'link', 'image', 'line_break', 'strikethrough', 'underline', 'superscript', 'subscript', 'html_inline', 'footnote_reference', 'footnote_definition', 'math_inline', 'math_block', 'definition_list', 'definition_term', 'definition_description'] | None
Get the node type string for a node instance.
This static method supports subclasses by using isinstance checks rather than exact type matching. If a node is a subclass of a known type, it will be identified by its parent type.
- Parameters:
node (Node) – AST node
- Returns:
Node type string (e.g., ‘heading’, ‘image’), or None if unknown
- Return type:
NodeType or None
Notes
The method iterates through known node types and returns the first match using isinstance checks. This allows custom subclasses to be recognized by their base type. For example, a custom MyImage(Image) subclass will be identified as type ‘image’.
Performance: Uses module-level _NODE_TYPE_MAP constant to avoid reconstructing the mapping on every call (hot path optimization).
This is a static method because it doesn’t depend on instance state, only on the module-level _NODE_TYPE_MAP constant. This allows it to be called without instantiating HookManager.
Examples
>>> from all2md.ast.nodes import Image >>> img = Image(url="test.png", alt_text="Test") >>> node_type = HookManager.get_node_type(img) >>> print(node_type) 'image'
- list_hooks() dict[Literal['post_ast', 'pre_transform', 'post_transform', 'pre_render', 'post_render', 'document', 'heading', 'paragraph', 'code_block', 'block_quote', 'list', 'list_item', 'table', 'table_row', 'table_cell', 'thematic_break', 'html_block', 'text', 'emphasis', 'strong', 'code', 'link', 'image', 'line_break', 'strikethrough', 'underline', 'superscript', 'subscript', 'html_inline', 'footnote_reference', 'footnote_definition', 'math_inline', 'math_block', 'definition_list', 'definition_term', 'definition_description'], list[tuple[int, Callable[[Any, HookContext], Any]]]]
List all registered hooks with their priorities.
This method provides a public API for enumerating hooks without exposing the internal _hooks dictionary structure.
- Returns:
Dictionary mapping hook targets to lists of (priority, hook) tuples. The returned dictionary is a shallow copy to prevent external modifications to internal state.
- Return type:
dict[HookTarget, list[tuple[int, HookCallable]]]
Examples
>>> manager = HookManager() >>> manager.register_hook('image', my_hook, priority=50) >>> hooks = manager.list_hooks() >>> print(hooks) {'image': [(50, <function my_hook>)]}
- clear() None
Clear all registered hooks.
This is primarily useful for testing.
- class all2md.transforms.HookContext
Bases:
objectContext passed to hook functions.
This class provides hooks with access to document state, metadata, and a shared data dictionary for passing information between hooks and transforms.
- Parameters:
document (Document) – The current document being processed
metadata (dict, default = empty dict) – Document metadata from the source format
shared (dict, default = empty dict) – Shared mutable dictionary for passing data between hooks/transforms
transform_name (str, optional) – Name of the current transform (for transform hooks)
node_path (list[Node], default = empty list) – Path from document root to current node (for node hooks). WARNING: This list is mutated during tree traversal. Not thread-safe.
Examples
Access context in a hook:
>>> def my_hook(node: Image, context: HookContext) -> Image: ... # Store image count in shared state ... context.shared['image_count'] = context.shared.get('image_count', 0) + 1 ... ... # Access document metadata ... if 'author' in context.metadata: ... print(f"Document by: {context.metadata['author']}") ... ... return node
- document: Document
- metadata: dict[str, Any]
- shared: dict[str, Any]
- transform_name: str | None = None
- node_path: list[Node]
- get_shared(key: str, default: Any = None) Any
Get a value from shared state.
- Parameters:
key (str) – Key to retrieve
default (Any, optional) – Default value if key not found
- Returns:
Value from shared state or default
- Return type:
Any
- set_shared(key: str, value: Any) None
Set a value in shared state.
- Parameters:
key (str) – Key to set
value (Any) – Value to store
- class all2md.transforms.Pipeline
Bases:
objectPipeline for transforming and rendering AST documents.
This class orchestrates the complete transformation and rendering pipeline, including transform resolution, hook execution, and rendering to output format.
- Parameters:
transforms (list, optional) – List of transforms to apply. Can be transform names (str) or NodeTransformer instances. Names are resolved via TransformRegistry
hooks (dict, optional) – Dictionary mapping hook targets to lists of hook callables
renderer (str, type, or renderer instance, optional) – Renderer to use for output. Can be: - Format name string (e.g., “markdown”) - looked up via registry - Renderer class (e.g., MarkdownRenderer) - Renderer instance (e.g., MarkdownRenderer()) Defaults to MarkdownRenderer with default options
options (BaseRendererOptions or MarkdownOptions, optional) – Options for rendering (used if renderer is string or class, ignored if instance)
progress_callback (ProgressCallback, optional) – Optional callback for progress updates during rendering
strict_hooks (bool, default = False) – Enable strict mode for hook exception handling. If True, hook exceptions are re-raised and abort the pipeline. If False (default), exceptions are logged and execution continues.
Examples
Create pipeline with default markdown renderer:
>>> pipeline = Pipeline( ... transforms=['remove-images'], ... hooks={'pre_render': [validate]} ... ) >>> output = pipeline.execute(document)
With custom renderer:
>>> from all2md.renderers.markdown import MarkdownRenderer >>> pipeline = Pipeline( ... transforms=['remove-images'], ... renderer=MarkdownRenderer(options=MarkdownRendererOptions(flavor='commonmark')) ... ) >>> output = pipeline.execute(document)
With strict hook mode:
>>> pipeline = Pipeline( ... transforms=['remove-images'], ... hooks={'image': [validate_image]}, ... strict_hooks=True # Hook failures will abort pipeline ... ) >>> output = pipeline.execute(document)
Initialize pipeline with transforms, hooks, renderer, and options.
- Parameters:
transforms (list, optional) – List of transforms to apply. Can be transform names (str) or NodeTransformer instances
hooks (dict, optional) – Dictionary mapping hook targets to lists of hook callables
renderer (str, type, renderer instance, False, or None) –
str: Format name to look up via registry (e.g., “markdown”)
type: Renderer class to instantiate
instance: Pre-configured renderer to use
False: Skip renderer setup (for AST-only processing)
None: Use default MarkdownRenderer (default)
options (BaseRendererOptions or MarkdownRendererOptions, optional) – Options for rendering (used if renderer is string or class, ignored if renderer is instance)
progress_callback (ProgressCallback, optional) – Optional callback for progress updates during rendering
strict_hooks (bool, default = False) – Enable strict mode for hook exception handling. If True, hook exceptions are re-raised and abort the pipeline. If False, exceptions are logged and execution continues.
- __init__(transforms: list[str | NodeTransformer] | None = None, hooks: dict[Literal['post_ast', 'pre_transform', 'post_transform', 'pre_render', 'post_render', 'document', 'heading', 'paragraph', 'code_block', 'block_quote', 'list', 'list_item', 'table', 'table_row', 'table_cell', 'thematic_break', 'html_block', 'text', 'emphasis', 'strong', 'code', 'link', 'image', 'line_break', 'strikethrough', 'underline', 'superscript', 'subscript', 'html_inline', 'footnote_reference', 'footnote_definition', 'math_inline', 'math_block', 'definition_list', 'definition_term', 'definition_description'], list[Callable[[Any, HookContext], Any]]] | None = None, renderer: str | type | Any | bool | None = None, options: BaseRendererOptions | MarkdownRendererOptions | None = None, progress_callback: Callable[[ProgressEvent], None] | None = None, strict_hooks: bool = False)
Initialize pipeline with transforms, hooks, renderer, and options.
- Parameters:
transforms (list, optional) – List of transforms to apply. Can be transform names (str) or NodeTransformer instances
hooks (dict, optional) – Dictionary mapping hook targets to lists of hook callables
renderer (str, type, renderer instance, False, or None) –
str: Format name to look up via registry (e.g., “markdown”)
type: Renderer class to instantiate
instance: Pre-configured renderer to use
False: Skip renderer setup (for AST-only processing)
None: Use default MarkdownRenderer (default)
options (BaseRendererOptions or MarkdownRendererOptions, optional) – Options for rendering (used if renderer is string or class, ignored if renderer is instance)
progress_callback (ProgressCallback, optional) – Optional callback for progress updates during rendering
strict_hooks (bool, default = False) – Enable strict mode for hook exception handling. If True, hook exceptions are re-raised and abort the pipeline. If False, exceptions are logged and execution continues.
- get_diagnostics() dict[str, Any]
Get diagnostic information about the pipeline configuration.
This method returns structured information about the pipeline’s configuration, useful for debugging, visualization, and documentation.
- Returns:
Dictionary containing: - transforms: List of transform names in execution order - hooks: Dictionary of hook targets with their registered hooks and priorities - renderer: Renderer class name - options: Renderer options class name (if available)
- Return type:
dict[str, Any]
Examples
>>> pipeline = Pipeline( ... transforms=['remove-images', 'heading-offset'], ... hooks={'image': [my_hook], 'pre_render': [validator]}, ... renderer='markdown' ... ) >>> diag = pipeline.get_diagnostics() >>> print(diag['transforms']) ['RemoveImagesTransform', 'HeadingOffsetTransform'] >>> print(diag['hooks']) {'image': [{'priority': 0, 'function': 'my_hook'}], ...}
- execute(document: Document) str | bytes
Execute complete pipeline.
This method runs the full transformation and rendering pipeline: 1. Execute post_ast hooks 2. Apply transforms (with pre/post transform hooks) 3. Apply element hooks 4. Execute pre_render hooks 5. Render to output format 6. Execute post_render hooks
If a progress_callback is configured, progress events are emitted at each stage of the pipeline.
- Parameters:
document (Document) – Document to process
- Returns:
Rendered output (type depends on renderer)
- Return type:
str or bytes
Examples
>>> pipeline = Pipeline(transforms=['remove-images']) >>> output = pipeline.execute(document)
- class all2md.transforms.HookAwareVisitor
Bases:
NodeTransformerVisitor that applies element hooks during tree traversal.
This visitor extends NodeTransformer to execute registered element hooks for each node type during traversal. It maintains the node path in the context for hooks that need to know the tree structure.
- Parameters:
hook_manager (HookManager) – Manager containing registered hooks
context (HookContext) – Context to pass to hooks
Examples
>>> hook_manager = HookManager() >>> hook_manager.register_hook('image', my_image_hook) >>> context = HookContext(document=doc) >>> visitor = HookAwareVisitor(hook_manager, context) >>> processed_doc = visitor.transform(doc)
Initialize visitor with hook manager and context.
- __init__(hook_manager: HookManager, context: HookContext)
Initialize visitor with hook manager and context.
- transform(node: Node) Node | None
Transform node and apply element hooks.
The node is pushed onto node_path before processing and remains there during child traversal, ensuring descendants can see full ancestry. If a hook replaces the node, the path is updated so descendants see the new node in their ancestry.
- all2md.transforms.apply(document: Document, transforms: list[str | NodeTransformer] | None = None, hooks: dict[Literal['post_ast', 'pre_transform', 'post_transform', 'pre_render', 'post_render', 'document', 'heading', 'paragraph', 'code_block', 'block_quote', 'list', 'list_item', 'table', 'table_row', 'table_cell', 'thematic_break', 'html_block', 'text', 'emphasis', 'strong', 'code', 'link', 'image', 'line_break', 'strikethrough', 'underline', 'superscript', 'subscript', 'html_inline', 'footnote_reference', 'footnote_definition', 'math_inline', 'math_block', 'definition_list', 'definition_term', 'definition_description'], list[Callable[[Any, HookContext], Any]]] | None = None, progress_callback: Callable[[ProgressEvent], None] | None = None, strict_hooks: bool = False) Document
Apply transforms and hooks to document without rendering.
This function provides AST-only processing by applying transforms and hooks to a document without the rendering stage. It reuses Pipeline internals to maintain consistent hook execution order.
This is useful for developers who want to: - Process AST structures programmatically - Chain multiple transformation passes - Inspect/modify documents before rendering - Build custom rendering pipelines
- Parameters:
document (Document) – AST document to process
transforms (list, optional) – List of transforms to apply. Can be transform names (str) or NodeTransformer instances
hooks (dict, optional) – Dictionary mapping hook targets to lists of hook callables. Hook targets can be pipeline stages (‘post_ast’, ‘pre_transform’, ‘post_transform’, ‘pre_render’) or node types (‘image’, ‘link’, etc.)
progress_callback (ProgressCallback, optional) – Optional callback for progress updates during processing
strict_hooks (bool, default = False) – Enable strict mode for hook exception handling. If True, hook exceptions are re-raised and abort the pipeline. If False (default), exceptions are logged and execution continues.
- Returns:
Processed document with transforms and hooks applied
- Return type:
- Raises:
TypeError – If transform is not a string or NodeTransformer
ValueError – If transform name is not found or a hook removes the document node
Notes
The following hooks are executed in order: 1. post_ast - After AST creation (document just came from conversion) 2. pre_transform - Before each transform 3. post_transform - After each transform 4. pre_render - Before element hooks (for document-level validation) 5. Element hooks - During tree traversal (image, link, heading, etc.)
The post_render hook is NOT executed since no rendering occurs.
Examples
Apply transforms only:
>>> from all2md import to_ast >>> from all2md.transforms import apply >>> doc = to_ast("document.pdf") >>> processed = apply(doc, transforms=['remove-images'])
Apply hooks only:
>>> def log_image(node, context): ... print(f"Found image: {node.url}") ... return node >>> processed = apply(doc, hooks={'image': [log_image]})
Apply both transforms and hooks:
>>> from all2md.transforms import HeadingOffsetTransform >>> processed = apply( ... doc, ... transforms=[HeadingOffsetTransform(offset=1), 'remove-images'], ... hooks={ ... 'pre_render': [validate_document], ... 'link': [rewrite_links] ... } ... )
Chain multiple processing passes:
>>> doc1 = apply(doc, transforms=['heading-offset']) >>> doc2 = apply(doc1, transforms=['remove-images']) >>> markdown = render(doc2)
With strict hook mode:
>>> processed = apply( ... doc, ... hooks={'image': [validate_image]}, ... strict_hooks=True # Hook failures will abort ... )
With progress tracking:
>>> def progress_handler(event): ... print(f"{event.event_type}: {event.message}") >>> processed = apply(doc, transforms=['remove-images'], progress_callback=progress_handler)
- all2md.transforms.render(document: Document, transforms: list[str | NodeTransformer] | None = None, hooks: dict[Literal['post_ast', 'pre_transform', 'post_transform', 'pre_render', 'post_render', 'document', 'heading', 'paragraph', 'code_block', 'block_quote', 'list', 'list_item', 'table', 'table_row', 'table_cell', 'thematic_break', 'html_block', 'text', 'emphasis', 'strong', 'code', 'link', 'image', 'line_break', 'strikethrough', 'underline', 'superscript', 'subscript', 'html_inline', 'footnote_reference', 'footnote_definition', 'math_inline', 'math_block', 'definition_list', 'definition_term', 'definition_description'], list[Callable[[Any, HookContext], Any]]] | None = None, renderer: str | type | Any | None = None, options: BaseRendererOptions | MarkdownRendererOptions | None = None, progress_callback: Callable[[ProgressEvent], None] | None = None, strict_hooks: bool = False, **kwargs: Any) str | bytes
Render document with transforms and hooks using specified renderer.
This is the high-level entry point for the transformation pipeline. It creates a Pipeline instance and executes it to produce rendered output.
- Parameters:
document (Document) – AST document to render
transforms (list, optional) – List of transforms to apply. Can be transform names (str) or NodeTransformer instances
hooks (dict, optional) – Dictionary mapping hook targets to lists of hook callables. Hook targets can be pipeline stages (‘pre_render’, ‘post_render’, etc.) or node types (‘image’, ‘link’, ‘heading’, etc.)
renderer (str, type, or renderer instance, optional) – Renderer to use. Can be: - Format name string (e.g., “markdown”) - looked up via registry - Renderer class (e.g., MarkdownRenderer) - Renderer instance (e.g., MarkdownRenderer()) Defaults to MarkdownRenderer
options (BaseRendererOptions or MarkdownOptions, optional) – Options for rendering (used if renderer is string or class)
progress_callback (ProgressCallback, optional) – Optional callback for progress updates during rendering
strict_hooks (bool, default = False) – Enable strict mode for hook exception handling. If True, hook exceptions are re-raised and abort the pipeline. If False (default), exceptions are logged and execution continues.
**kwargs – Additional keyword arguments passed to MarkdownOptions if options is not provided and renderer is markdown
- Returns:
Rendered output (type depends on renderer)
- Return type:
str or bytes
- Raises:
TypeError – If transform is not a string or NodeTransformer
ValueError – If transform name is not found or a hook removes a required node
Examples
- Basic rendering to markdown:
>>> from all2md import to_ast >>> from all2md.transforms import render >>> doc = to_ast("document.pdf") >>> markdown = render(doc)
- With transforms by name:
>>> markdown = render(doc, transforms=['remove-images'])
- With custom renderer:
>>> from all2md.renderers.markdown import MarkdownRenderer >>> output = render( ... doc, ... renderer=MarkdownRenderer(options=MarkdownRendererOptions(flavor='commonmark')) ... )
- With hooks:
>>> def log_image(node, context): ... print(f"Found image: {node.url}") ... return node >>> markdown = render(doc, hooks={'image': [log_image]})
- Combined transforms and hooks:
>>> markdown = render( ... doc, ... transforms=['heading-offset', 'remove-images'], ... hooks={ ... 'pre_render': [validate_document], ... 'link': [rewrite_links], ... 'post_render': [add_footer] ... }, ... options=MarkdownRendererOptions(flavor='commonmark') ... )
- With MarkdownOptions kwargs:
>>> markdown = render(doc, flavor='gfm', emphasis_symbol='_')
- With strict hook mode:
>>> markdown = render( ... doc, ... hooks={'image': [validate_image]}, ... strict_hooks=True # Hook failures will abort ... )
- class all2md.transforms.RemoveImagesTransform
Bases:
NodeTransformerRemove all Image nodes from the AST.
This transform removes every Image node it encounters, useful for creating text-only versions of documents or reducing document size.
Examples
>>> transform = RemoveImagesTransform() >>> doc_without_images = transform.transform(document)
- class all2md.transforms.RemoveNodesTransform
Bases:
NodeTransformerRemove nodes of specified types from the AST.
This is a generic transform that can remove any combination of node types. Useful for stripping specific elements like tables, code blocks, or any other node type.
- Parameters:
node_types (list[str]) – List of node type names to remove (e.g., [‘image’, ‘table’, ‘code_block’])
Examples
Remove images and tables:
>>> transform = RemoveNodesTransform(node_types=['image', 'table']) >>> cleaned_doc = transform.transform(document)
Initialize with list of node types to remove.
- Parameters:
node_types (list[str]) – Node type names to remove
- Raises:
ValueError – If ‘document’ is in node_types (cannot remove root node), or if any node_type is unknown (typo detection)
- __init__(node_types: list[str])
Initialize with list of node types to remove.
- Parameters:
node_types (list[str]) – Node type names to remove
- Raises:
ValueError – If ‘document’ is in node_types (cannot remove root node), or if any node_type is unknown (typo detection)
- class all2md.transforms.HeadingOffsetTransform
Bases:
NodeTransformerShift heading levels by a specified offset.
This transform adjusts all heading levels in the document by adding an offset value. Levels are clamped to the valid range of 1-6.
- Parameters:
offset (int, default = 1) – Number of levels to shift (positive to increase, negative to decrease)
Examples
Increase all heading levels by 1 (H1 becomes H2):
>>> transform = HeadingOffsetTransform(offset=1) >>> new_doc = transform.transform(document)
Decrease all heading levels by 1 (H2 becomes H1):
>>> transform = HeadingOffsetTransform(offset=-1) >>> new_doc = transform.transform(document)
Initialize with heading level offset.
- Parameters:
offset (int) – Heading level adjustment
- __init__(offset: int = 1)
Initialize with heading level offset.
- Parameters:
offset (int) – Heading level adjustment
- class all2md.transforms.TitlePromotionTransform
Bases:
NodeTransformerPromote a leading H1 to a document title and shift subsequent headings.
When converting Markdown to Word, a leading
# Headingis typically the document title rather than a “Heading 1”. This transform detects a leading H1 (skipping empty / whitespace-only paragraphs before it), marks it withmetadata["is_title"] = True, and promotes all subsequent headings by one level (H2 → H1, H3 → H2, etc.) so they style properly under the title.If the first real content node is not an H1, the document passes through unchanged.
Examples
>>> transform = TitlePromotionTransform() >>> new_doc = transform.transform(document)
- class all2md.transforms.LinkRewriterTransform
Bases:
NodeTransformerRewrite link URLs using regex pattern matching.
This transform allows flexible URL rewriting using regular expressions. Useful for converting relative links to absolute, updating base URLs, or modifying link schemes.
- Parameters:
pattern (str) – Regex pattern to match in URLs
replacement (str) – Replacement string (can include regex groups like \1, \2)
- Raises:
SecurityError – If the pattern contains dangerous constructs that could lead to ReDoS (Regular Expression Denial of Service) attacks
Examples
Convert relative links to absolute:
>>> transform = LinkRewriterTransform( ... pattern=r'^/docs/', ... replacement='https://example.com/docs/' ... ) >>> new_doc = transform.transform(document)
Notes
For security reasons, this transform validates user-supplied regex patterns to prevent ReDoS attacks. Patterns with nested quantifiers or excessive backtracking potential are rejected. See validate_user_regex_pattern() for details on what patterns are considered safe.
Initialize with pattern and replacement.
- Parameters:
pattern (str) – Regex pattern
replacement (str) – Replacement string
- Raises:
SecurityError – If pattern contains dangerous constructs
- __init__(pattern: str, replacement: str)
Initialize with pattern and replacement.
- Parameters:
pattern (str) – Regex pattern
replacement (str) – Replacement string
- Raises:
SecurityError – If pattern contains dangerous constructs
- class all2md.transforms.TextReplacerTransform
Bases:
NodeTransformerFind and replace text in Text nodes.
This transform performs simple text replacement across all Text nodes in the document. For regex-based replacement, use a custom transform.
- Parameters:
find (str) – Text to find
replace (str) – Replacement text
Examples
Replace all instances of “TODO”:
>>> transform = TextReplacerTransform(find="TODO", replace="DONE") >>> new_doc = transform.transform(document)
Initialize with find and replace strings.
- Parameters:
find (str) – Text to find
replace (str) – Replacement text
- __init__(find: str, replace: str)
Initialize with find and replace strings.
- Parameters:
find (str) – Text to find
replace (str) – Replacement text
- class all2md.transforms.AddHeadingIdsTransform
Bases:
NodeTransformerGenerate and add unique IDs to heading nodes.
This transform creates slugified IDs from heading text and adds them to the heading metadata. These IDs can be used by renderers to create HTML anchors for linkable sections.
- Parameters:
id_prefix (str, default = "") – Prefix to add to all generated IDs
separator (str, default = "-") – Separator for multi-word slugs and duplicate handling
Examples
Basic usage:
>>> transform = AddHeadingIdsTransform() >>> new_doc = transform.transform(document) >>> # "My Heading" -> metadata['id'] = "my-heading"
With prefix:
>>> transform = AddHeadingIdsTransform(id_prefix="doc-") >>> new_doc = transform.transform(document) >>> # "My Heading" -> metadata['id'] = "doc-my-heading"
Initialize with prefix and separator.
- Parameters:
id_prefix (str) – Prefix for IDs
separator (str) – Word separator
- __init__(id_prefix: str = '', separator: str = '-')
Initialize with prefix and separator.
- Parameters:
id_prefix (str) – Prefix for IDs
separator (str) – Word separator
- class all2md.transforms.RemoveBoilerplateTextTransform
Bases:
NodeTransformerRemove paragraphs matching common boilerplate patterns.
This transform removes paragraphs whose text matches predefined patterns like “CONFIDENTIAL”, “Page X of Y”, etc. Useful for cleaning up corporate documents and reports.
- Parameters:
patterns (list[str], optional) – List of regex patterns to match (default: common boilerplate)
skip_if_truncated (bool, default = True) – If True, skip pattern matching when text exceeds MAX_TEXT_LENGTH_FOR_REGEX to avoid false positives with end-anchored patterns ($). If False, match against truncated text (may produce incorrect results with anchors).
- Raises:
SecurityError – If any user-supplied pattern contains dangerous constructs that could lead to ReDoS (Regular Expression Denial of Service) attacks
Examples
Use default patterns:
>>> transform = RemoveBoilerplateTextTransform() >>> cleaned_doc = transform.transform(document)
Custom patterns with anchoring:
>>> transform = RemoveBoilerplateTextTransform( ... patterns=[r"^DRAFT$", r"^INTERNAL ONLY$", r"^Page \d+ of \d+$"] ... ) >>> cleaned_doc = transform.transform(document)
Allow matching truncated text (not recommended):
>>> transform = RemoveBoilerplateTextTransform(skip_if_truncated=False) >>> cleaned_doc = transform.transform(document)
Notes
Pattern Matching Semantics: This transform uses Python’s re.match(), which implicitly anchors at the start of the string (equivalent to adding ^ at the beginning). For exact matching of entire paragraphs, patterns should include an end anchor ($). For example:
r”CONFIDENTIAL” - Matches paragraphs starting with “CONFIDENTIAL”
r”CONFIDENTIAL$” - Matches paragraphs that are exactly “CONFIDENTIAL” or start with “CONFIDENTIAL” followed by only whitespace
r”^CONFIDENTIAL$” - Explicitly anchored (redundant ^, but clearer)
If you need to match patterns anywhere in the text (not just at the start), use re.search() semantics by implementing a custom transform.
Security: For security reasons, this transform validates user-supplied regex patterns to prevent ReDoS attacks. Default patterns are pre-validated and trusted. Patterns with nested quantifiers or excessive backtracking potential are rejected. See validate_user_regex_pattern() for details.
Truncation Behavior: Text longer than MAX_TEXT_LENGTH_FOR_REGEX (10000 characters) is truncated before matching for ReDoS protection. With
skip_if_truncated=True(default), such paragraphs are preserved to avoid false positives from patterns using end anchors ($). This is safer but may miss some boilerplate. Withskip_if_truncated=False, matching proceeds on truncated text, which may incorrectly match or not match patterns with anchors.Initialize with patterns.
- Parameters:
patterns (list[str] or None) – Regex patterns to match (None uses defaults)
skip_if_truncated (bool) – Skip matching when text is truncated (safer default)
- Raises:
SecurityError – If any user-supplied pattern contains dangerous constructs
- __init__(patterns: list[str] | None = None, skip_if_truncated: bool = True)
Initialize with patterns.
- Parameters:
patterns (list[str] or None) – Regex patterns to match (None uses defaults)
skip_if_truncated (bool) – Skip matching when text is truncated (safer default)
- Raises:
SecurityError – If any user-supplied pattern contains dangerous constructs
- class all2md.transforms.AddConversionTimestampTransform
Bases:
NodeTransformerAdd conversion timestamp to document metadata.
This transform adds a timezone-aware UTC timestamp to the document metadata indicating when the conversion occurred. Useful for tracking document versions and conversion history. All timestamps are generated in UTC to ensure consistency across different time zones.
- Parameters:
field_name (str, default = "conversion_timestamp") – Metadata field name for the timestamp
timestamp_format (str, default = "iso") – Timestamp format: “iso” for ISO 8601 with timezone, “unix” for Unix timestamp, or any strftime format string
timespec (str, default = "seconds") – Time precision for ISO format timestamps. Valid values are: - “auto”: Automatic precision - “hours”: Hours precision - “minutes”: Minutes precision - “seconds”: Seconds precision (default, reduces noisy diffs) - “milliseconds”: Milliseconds precision - “microseconds”: Microseconds precision Only applies when timestamp_format=”iso”. Ignored for other formats.
Examples
Add ISO 8601 timestamp with second precision (default):
>>> transform = AddConversionTimestampTransform() >>> new_doc = transform.transform(document) >>> # metadata['conversion_timestamp'] = "2025-01-01T12:00:00+00:00"
Add ISO 8601 timestamp with microsecond precision:
>>> transform = AddConversionTimestampTransform(timespec="microseconds") >>> new_doc = transform.transform(document) >>> # metadata['conversion_timestamp'] = "2025-01-01T12:00:00.123456+00:00"
Add Unix timestamp:
>>> transform = AddConversionTimestampTransform(timestamp_format="unix") >>> new_doc = transform.transform(document) >>> # metadata['conversion_timestamp'] = "1735732800"
Custom strftime format:
>>> transform = AddConversionTimestampTransform( ... field_name="converted_at", ... timestamp_format="%Y-%m-%d %H:%M:%S UTC" ... ) >>> new_doc = transform.transform(document) >>> # metadata['converted_at'] = "2025-01-01 12:00:00 UTC"
Notes
All timestamps are generated in UTC (Coordinated Universal Time) using datetime.now(timezone.utc). This ensures consistent timestamps regardless of the server’s local timezone.
The default timespec=”seconds” is recommended to reduce noisy git diffs when regenerating documents, as subsecond precision is rarely needed for document conversion timestamps.
Initialize with field name, format, and time precision.
- Parameters:
field_name (str) – Metadata field name
timestamp_format (str) – Timestamp format
timespec (str) – Time precision for ISO format (default: “seconds”)
- __init__(field_name: str = 'conversion_timestamp', timestamp_format: str = 'iso', timespec: str = 'seconds')
Initialize with field name, format, and time precision.
- Parameters:
field_name (str) – Metadata field name
timestamp_format (str) – Timestamp format
timespec (str) – Time precision for ISO format (default: “seconds”)
- class all2md.transforms.CalculateWordCountTransform
Bases:
NodeTransformerCalculate word and character counts and add to metadata.
This transform traverses the entire document, extracts all text, and calculates word and character counts. The counts are added to the document metadata.
- Parameters:
word_field (str, default = "word_count") – Metadata field name for word count
char_field (str, default = "char_count") – Metadata field name for character count
Examples
Basic usage:
>>> transform = CalculateWordCountTransform() >>> new_doc = transform.transform(document) >>> # metadata['word_count'] = 150 >>> # metadata['char_count'] = 890
Custom field names:
>>> transform = CalculateWordCountTransform( ... word_field="words", ... char_field="characters" ... ) >>> new_doc = transform.transform(document)
Notes
Character Count Behavior: The char_count metric represents the length of normalized text extracted from the AST, not the original document’s character count. During text extraction, text fragments from separate AST nodes are joined with spaces, which may introduce synthetic spacing not present in the original document. For example, if the AST contains two adjacent Text nodes
Text("hello")andText("world"), the extracted text will be"hello world"(11 characters including the inserted space), even though the original text nodes only contain 10 characters total.This normalized approach provides consistent metrics across different AST structures, though it may not exactly match the original document’s byte count. Word count is calculated by splitting the normalized text on whitespace, which is generally more robust to these variations.
Initialize with field names.
- Parameters:
word_field (str) – Field name for word count
char_field (str) – Field name for character count
- __init__(word_field: str = 'word_count', char_field: str = 'char_count')
Initialize with field names.
- Parameters:
word_field (str) – Field name for word count
char_field (str) – Field name for character count
- class all2md.transforms.AddAttachmentFootnotesTransform
Bases:
NodeTransformerAdd footnote definitions for attachment references.
When attachments are processed with alt_text_mode=”footnote”, they generate footnote-style references like ![image][^label] but no corresponding definitions. This transform scans the AST for such references and adds FootnoteDefinition nodes with source information.
- Parameters:
section_title (str or None, default "Attachments") – Title for the footnote section heading. If None, no heading is added.
add_definitions_for_images (bool, default True) – Add definitions for image footnote references
add_definitions_for_links (bool, default True) – Add definitions for link footnote references
Examples
Add footnote definitions after conversion:
>>> transform = AddAttachmentFootnotesTransform() >>> doc_with_footnotes = transform.transform(document)
Custom section title:
>>> transform = AddAttachmentFootnotesTransform(section_title="Image Sources") >>> doc_with_footnotes = transform.transform(document)
Notes
This transform works by: 1. Collecting all Image and Link nodes with empty URLs (indicates footnote mode) 2. Extracting footnote labels from alt text or title 3. Handling duplicate labels by appending numeric suffixes (-2, -3, etc.) 4. Creating FootnoteDefinition nodes with source information 5. Appending definitions to the end of the document
Duplicate labels are resolved using a counter mechanism similar to heading ID generation. When a label appears multiple times, subsequent occurrences get a numeric suffix to ensure unique footnote identifiers
Initialize transform with options.
- Parameters:
section_title (str or None) – Heading for footnotes section
add_definitions_for_images (bool) – Whether to process image footnotes
add_definitions_for_links (bool) – Whether to process link footnotes
- __init__(section_title: str | None = 'Attachments', add_definitions_for_images: bool = True, add_definitions_for_links: bool = True)
Initialize transform with options.
- Parameters:
section_title (str or None) – Heading for footnotes section
add_definitions_for_images (bool) – Whether to process image footnotes
add_definitions_for_links (bool) – Whether to process link footnotes
- class all2md.transforms.GenerateTocTransform
Bases:
NodeTransformerGenerate a table of contents from document headings.
This transform extracts headings from the document and generates a nested list representing the table of contents. The TOC can be placed at the top or bottom of the document.
- Parameters:
title (str, default = "Table of Contents") – Title for the TOC section
max_depth (int, default = 3) – Maximum heading level to include (1-6)
position ({"top", "bottom"}, default = "top") – Position to insert the TOC
add_links (bool, default = True) – Whether to create links to headings (requires heading IDs)
separator (str, default = "-") – Separator for generating heading IDs when not present
set_ids_if_missing (bool, default = False) – If True, inject generated IDs into heading metadata when missing. This ensures renderers create anchors matching the TOC links. If False (default), IDs are only used for TOC links.
Examples
Basic usage:
>>> transform = GenerateTocTransform() >>> doc_with_toc = transform.transform(document)
Custom depth and position:
>>> transform = GenerateTocTransform( ... title="Contents", ... max_depth=2, ... position="bottom" ... ) >>> doc_with_toc = transform.transform(document)
Inject IDs into headings:
>>> transform = GenerateTocTransform(set_ids_if_missing=True) >>> doc_with_toc = transform.transform(document) >>> # Headings now have 'id' in metadata for renderer anchors
Notes
This transform works best when combined with AddHeadingIdsTransform, which generates unique IDs for headings that can be used for navigation. If headings don’t have IDs, the transform will generate slugified IDs on-the-fly for link targets.
ID Injection: With
set_ids_if_missing=True, generated IDs are injected into heading metadata so renderers can create matching anchors. This is recommended when not using AddHeadingIdsTransform. Alternatively, run AddHeadingIdsTransform before GenerateTocTransform to ensure all headings have IDs upfront.Initialize with TOC generation options.
- Parameters:
title (str) – TOC section title
max_depth (int) – Maximum heading level (1-6)
position (str) – Position for TOC (“top” or “bottom”)
add_links (bool) – Whether to generate links
separator (str) – Separator for ID generation
set_ids_if_missing (bool) – Inject generated IDs into heading metadata
- Raises:
ValueError – If max_depth is not between 1 and 6, or position is invalid
- __init__(title: str = 'Table of Contents', max_depth: int = 3, position: str = 'top', add_links: bool = True, separator: str = '-', set_ids_if_missing: bool = False)
Initialize with TOC generation options.
- Parameters:
title (str) – TOC section title
max_depth (int) – Maximum heading level (1-6)
position (str) – Position for TOC (“top” or “bottom”)
add_links (bool) – Whether to generate links
separator (str) – Separator for ID generation
set_ids_if_missing (bool) – Inject generated IDs into heading metadata
- Raises:
ValueError – If max_depth is not between 1 and 6, or position is invalid
For transforms module documentation organized by functionality, see Transforms.