Configuration¶

ThunderDots is configured through three main layers:

client-level parameters passed to ThunderDots(...);
collection_params, which controls collection traversal;
resource_params, which controls document fetching and fragmentation.

The defaults below reflect the current implementation in thunderdots/client.py and thunderdots/config.py.

Client parameters¶

from thunderdots import ThunderDots

td = ThunderDots(
    endpoint_dts="https://dots.chartes.psl.eu/api/dts",
    collection_params={"collection_id": "ENCPOS_1972"},
    resource_params={"fragment_mode": "auto"},
)

Parameter	Type	Default	Role
`endpoint_dts`	`str`	required	DTS API root URL. Trailing `/` is removed internally.
`fetch_collection_metadata`	`bool`	`True`	Kept in the client configuration for collection metadata workflows.
`fetch_resource_metadata`	`bool`	`True`	Kept in the client configuration for resource metadata workflows.
`collection_params`	`dict \\| None`	`None`	Collection traversal options. `None` means default `CollectionParams`.
`resource_params`	`dict \\| None`	`None`	Resource fetching and fragmentation options. `None` means default `ResourceParams`.
`validate`	`bool`	`False`	Add JSON Schema validation reports to `results()["validation"]`.
`validation_profile`	`str`	`"dts"`	Stored in the configuration. Current automatic validation uses `output` and `resource_result` profiles.
`verbose`	`bool`	`True`	Enable Rich progress output.
`concurrency`	`int`	`20`	Number of concurrent workers for collection walking and resource fetching.
`timeout`	`float`	`30.0`	Legacy/global timeout value stored in the configuration.
`request_timeout`	`float`	`20.0`	HTTP request timeout passed to the `HttpxFetcher`.
`retries`	`int`	`2`	Retry attempts for temporary HTTP failures. Values are clamped by the fetcher between `0` and `5`.
`backoff_ms`	`int`	`200`	Base retry backoff in milliseconds.
`output_path`	`str \\| None`	`None`	Full JSON output path. Parent directories are created automatically.
`cache_csv_path`	`str \\| None`	`None`	Flat CSV cache/index path for fetched resources.
`use_cache`	`bool`	`True`	If `output_path` exists, reload it instead of running network calls.

Note

endpoint_dts is the only required constructor argument. If it is empty, ThunderDots raises ValueError("endpoint_dts is required").

Collection parameters¶

collection_params is converted internally to a CollectionParams dataclass.

collection_params = {
    "collection_id": "ENCPOS_1900",
    "excluded_ids": ["COLLECTION_TO_SKIP"],
    "metadata_dublincore": ["title"],
    "metadata_extensions": [],
    "fetch_linked_parents": True,
}

Parameter	Type	Default	Role
`collection_id`	`str \\| None`	`None`	Starting collection. `None` or an empty value starts at the DTS root collection.
`excluded_ids`	`list[str]`	`[]`	Collections or resources to ignore during traversal.
`metadata_dublincore`	`list[str] \\| None`	`None`	Dublin Core collection fields to keep. `None` keeps all fields; `[]` keeps none.
`metadata_extensions`	`list[str] \\| None`	`None`	Extension collection fields to keep. `None` keeps all fields; `[]` keeps none.
`fetch_linked_parents`	`bool`	`True`	Fetch linked parent collections for the current collection.

Metadata filtering semantics¶

ThunderDots intentionally distinguishes None from an empty list:

# Keep all Dublin Core metadata and no extension metadata.
collection_params = {
    "collection_id": "ENCPOS_1972",
    "metadata_dublincore": None,
    "metadata_extensions": [],
}

None means keep all metadata from that namespace.
[] means keep no metadata from that namespace.
['title', 'creator'] means keep only those fields.

Resource parameters¶

resource_params is converted internally to a ResourceParams dataclass.

resource_params = {
    "fragment_mode": "navigation",
    "metadata_dublincore": ["title", "creator", "date"],
    "metadata_extensions": ["dct:coverage"],
    "add_head_to_content": False,
    "include_breadcrumb": True,
    "fetch_linked_parents": True,
}

Parameter	Type	Default	Role
`metadata_dublincore`	`list[str] \\| None`	`None`	Dublin Core resource fields. `None` keeps all fields; `[]` keeps none.
`metadata_extensions`	`list[str] \\| None`	`None`	Extension resource fields. `None` keeps all fields; `[]` keeps none.
`add_head_to_content`	`bool`	`True`	Add headings to extracted text.
`include_breadcrumb`	`bool`	`True`	Add a `breadcrumb` field to fragments when available.
`exclude_heads_contains`	`list[str]`	`[]`	Exclude fragments whose heading contains one of these strings. Matching is case-insensitive and accent-insensitive.
`fetch_document`	`bool`	`True`	Fetch `/document`. If `False`, resources are returned without text fragments.
`fetch_navigation`	`bool`	`True`	Fetch `/navigation` when needed by `navigation` or `auto` mode.
`fetch_linked_parents`	`bool`	`True`	Fetch linked parent collections for each resource.
`fragment_mode`	`str`	`"auto"`	Fragmentation strategy: `auto`, `navigation`, `document`, or `tei_xpath`.
`fragment_xpath`	`str \\| None`	`None`	TEI XPath used when `fragment_mode="tei_xpath"`. Required for `tei_xpath`.
`title_xpath`	`str`	`"./tei:head"`	Local heading XPath used in `tei_xpath` mode.
`remove_fragment_heads`	`bool`	`True`	Remove local `<head>` nodes from fragment content in `tei_xpath` mode.
`generated_id_prefix`	`str`	`"__DOCUMENT__"`	Prefix for generated fragment IDs when no `xml:id` is available.

Fragment parameters¶

fragment_params is converted internally to a FragmentsParams dataclass.

fragment_params = {
    "metadata_dublincore": ["title", "creator", "date"],
}

Parameter	Type	Default	Role
`metadata_dublincore`	`list[str]`	`None`	Dublin Core fragment fields. `None` keeps all fields; `[]` keeps none.

Fragmentation modes¶

`auto`¶

auto is the default mode.

If fetch_navigation=True and the resource declares citationTrees.maxCiteDepth > 0, ThunderDots uses /navigation plus /document.
Otherwise, ThunderDots falls back to document mode.

resource_params = {
    "fragment_mode": "auto",
}

`document`¶

document mode fetches /document and returns one global fragment per resource.

resource_params = {
    "fragment_mode": "document",
    "fetch_document": True,
    "fetch_navigation": False,
    "add_head_to_content": False,
}

Use this mode when you want one full-text record per DTS resource and plan to apply your own chunking later.

`navigation`¶

navigation mode fetches /navigation and /document, then aligns DTS navigation identifiers with TEI xml:id values.

resource_params = {
    "fragment_mode": "navigation",
    "fetch_document": True,
    "fetch_navigation": True,
    "add_head_to_content": False,
    "include_breadcrumb": True,
}

Use this mode when the endpoint exposes a reliable citation tree and you want fragments to match citable DTS identifiers.

`tei_xpath`¶

tei_xpath mode ignores DTS navigation and fragments the TEI/XML document using your XPath expression.

resource_params = {
    "fragment_mode": "tei_xpath",
    "fragment_xpath": ".//tei:text/tei:body/tei:div",
    "title_xpath": "./tei:head",
    "remove_fragment_heads": True,
    "add_head_to_content": False,
    "fetch_document": True,
    "fetch_navigation": False,
}

Use this mode when you want full control over the documentary unit: one fragment per <div>, <p>, <ab>, or any project-specific TEI node.

Ready-to-copy configurations¶

Full document per resource¶

td = ThunderDots(
    endpoint_dts=ENDPOINT_DTS,
    collection_params={"collection_id": COLLECTION_ID},
    resource_params={
        "fragment_mode": "document",
        "fetch_document": True,
        "fetch_navigation": False,
        "add_head_to_content": False,
        "include_breadcrumb": False,
    },
)

td = ThunderDots(
    endpoint_dts=ENDPOINT_DTS,
    collection_params={"collection_id": COLLECTION_ID},
    resource_params={
        "fragment_mode": "navigation",
        "fetch_document": True,
        "fetch_navigation": True,
        "metadata_dublincore": ["title", "creator", "date", "coverage"],
        "metadata_extensions": ["dct:coverage", "dct:extend"],
        "add_head_to_content": False,
        "include_breadcrumb": True,
        "exclude_heads_contains": [
            "index",
            "appendices",
            "annexes",
            "sources",
            "bibliographie",
            "iconographie",
        ],
    },
)

TEI division fragments¶

td = ThunderDots(
    endpoint_dts=ENDPOINT_DTS,
    collection_params={"collection_id": COLLECTION_ID},
    resource_params={
        "fragment_mode": "tei_xpath",
        "fragment_xpath": ".//tei:text/tei:body/tei:div",
        "title_xpath": "./tei:head",
        "remove_fragment_heads": True,
        "add_head_to_content": False,
        "fetch_document": True,
        "fetch_navigation": False,
        "include_breadcrumb": True,
        "generated_id_prefix": "__DOCUMENT__",
    },
)

Deprecated compatibility parameter¶

keep_metadata is still accepted in both collection_params and resource_params, but it emits a DeprecationWarning.

resource_params = {
    "keep_metadata": ["dublincore.creator", "dct:coverage", "extensions.download"],
}

Prefer the explicit form:

resource_params = {
    "metadata_dublincore": ["creator"],
    "metadata_extensions": ["dct:coverage", "download"],
}