Configuration¶
ThunderDots is configured through three main layers:
- client-level parameters passed to
ThunderDots(...); collection_params, which controls collection traversal;resource_params, which controls document fetching and fragmentation.
The defaults below reflect the current implementation in thunderdots/client.py and thunderdots/config.py.
Client parameters¶
from thunderdots import ThunderDots
td = ThunderDots(
endpoint_dts="https://dots.chartes.psl.eu/api/dts",
collection_params={"collection_id": "ENCPOS_1972"},
resource_params={"fragment_mode": "auto"},
)
| Parameter | Type | Default | Role |
|---|---|---|---|
endpoint_dts |
str |
required | DTS API root URL. Trailing / is removed internally. |
fetch_collection_metadata |
bool |
True |
Kept in the client configuration for collection metadata workflows. |
fetch_resource_metadata |
bool |
True |
Kept in the client configuration for resource metadata workflows. |
collection_params |
dict \| None |
None |
Collection traversal options. None means default CollectionParams. |
resource_params |
dict \| None |
None |
Resource fetching and fragmentation options. None means default ResourceParams. |
validate |
bool |
False |
Add JSON Schema validation reports to results()["validation"]. |
validation_profile |
str |
"dts" |
Stored in the configuration. Current automatic validation uses output and resource_result profiles. |
verbose |
bool |
True |
Enable Rich progress output. |
concurrency |
int |
20 |
Number of concurrent workers for collection walking and resource fetching. |
timeout |
float |
30.0 |
Legacy/global timeout value stored in the configuration. |
request_timeout |
float |
20.0 |
HTTP request timeout passed to the HttpxFetcher. |
retries |
int |
2 |
Retry attempts for temporary HTTP failures. Values are clamped by the fetcher between 0 and 5. |
backoff_ms |
int |
200 |
Base retry backoff in milliseconds. |
output_path |
str \| None |
None |
Full JSON output path. Parent directories are created automatically. |
cache_csv_path |
str \| None |
None |
Flat CSV cache/index path for fetched resources. |
use_cache |
bool |
True |
If output_path exists, reload it instead of running network calls. |
Note
endpoint_dts is the only required constructor argument. If it is empty, ThunderDots raises ValueError("endpoint_dts is required").
Collection parameters¶
collection_params is converted internally to a CollectionParams dataclass.
collection_params = {
"collection_id": "ENCPOS_1900",
"excluded_ids": ["COLLECTION_TO_SKIP"],
"metadata_dublincore": ["title"],
"metadata_extensions": [],
"fetch_linked_parents": True,
}
| Parameter | Type | Default | Role |
|---|---|---|---|
collection_id |
str \| None |
None |
Starting collection. None or an empty value starts at the DTS root collection. |
excluded_ids |
list[str] |
[] |
Collections or resources to ignore during traversal. |
metadata_dublincore |
list[str] \| None |
None |
Dublin Core collection fields to keep. None keeps all fields; [] keeps none. |
metadata_extensions |
list[str] \| None |
None |
Extension collection fields to keep. None keeps all fields; [] keeps none. |
fetch_linked_parents |
bool |
True |
Fetch linked parent collections for the current collection. |
Metadata filtering semantics¶
ThunderDots intentionally distinguishes None from an empty list:
# Keep all Dublin Core metadata and no extension metadata.
collection_params = {
"collection_id": "ENCPOS_1972",
"metadata_dublincore": None,
"metadata_extensions": [],
}
Nonemeans keep all metadata from that namespace.[]means keep no metadata from that namespace.['title', 'creator']means keep only those fields.
Resource parameters¶
resource_params is converted internally to a ResourceParams dataclass.
resource_params = {
"fragment_mode": "navigation",
"metadata_dublincore": ["title", "creator", "date"],
"metadata_extensions": ["dct:coverage"],
"add_head_to_content": False,
"include_breadcrumb": True,
"fetch_linked_parents": True,
}
| Parameter | Type | Default | Role |
|---|---|---|---|
metadata_dublincore |
list[str] \| None |
None |
Dublin Core resource fields. None keeps all fields; [] keeps none. |
metadata_extensions |
list[str] \| None |
None |
Extension resource fields. None keeps all fields; [] keeps none. |
add_head_to_content |
bool |
True |
Add headings to extracted text. |
include_breadcrumb |
bool |
True |
Add a breadcrumb field to fragments when available. |
exclude_heads_contains |
list[str] |
[] |
Exclude fragments whose heading contains one of these strings. Matching is case-insensitive and accent-insensitive. |
fetch_document |
bool |
True |
Fetch /document. If False, resources are returned without text fragments. |
fetch_navigation |
bool |
True |
Fetch /navigation when needed by navigation or auto mode. |
fetch_linked_parents |
bool |
True |
Fetch linked parent collections for each resource. |
fragment_mode |
str |
"auto" |
Fragmentation strategy: auto, navigation, document, or tei_xpath. |
fragment_xpath |
str \| None |
None |
TEI XPath used when fragment_mode="tei_xpath". Required for tei_xpath. |
title_xpath |
str |
"./tei:head" |
Local heading XPath used in tei_xpath mode. |
remove_fragment_heads |
bool |
True |
Remove local <head> nodes from fragment content in tei_xpath mode. |
generated_id_prefix |
str |
"__DOCUMENT__" |
Prefix for generated fragment IDs when no xml:id is available. |
Fragment parameters¶
fragment_params is converted internally to a FragmentsParams dataclass.
fragment_params = {
"metadata_dublincore": ["title", "creator", "date"],
}
| Parameter | Type | Default | Role |
|---|---|---|---|
metadata_dublincore |
list[str] |
None |
Dublin Core fragment fields. None keeps all fields; [] keeps none. |
Fragmentation modes¶
auto¶
auto is the default mode.
- If
fetch_navigation=Trueand the resource declarescitationTrees.maxCiteDepth > 0, ThunderDots uses/navigationplus/document. - Otherwise, ThunderDots falls back to
documentmode.
resource_params = {
"fragment_mode": "auto",
}
document¶
document mode fetches /document and returns one global fragment per resource.
resource_params = {
"fragment_mode": "document",
"fetch_document": True,
"fetch_navigation": False,
"add_head_to_content": False,
}
Use this mode when you want one full-text record per DTS resource and plan to apply your own chunking later.
navigation¶
navigation mode fetches /navigation and /document, then aligns DTS navigation identifiers with TEI xml:id values.
resource_params = {
"fragment_mode": "navigation",
"fetch_document": True,
"fetch_navigation": True,
"add_head_to_content": False,
"include_breadcrumb": True,
}
Use this mode when the endpoint exposes a reliable citation tree and you want fragments to match citable DTS identifiers.
tei_xpath¶
tei_xpath mode ignores DTS navigation and fragments the TEI/XML document using your XPath expression.
resource_params = {
"fragment_mode": "tei_xpath",
"fragment_xpath": ".//tei:text/tei:body/tei:div",
"title_xpath": "./tei:head",
"remove_fragment_heads": True,
"add_head_to_content": False,
"fetch_document": True,
"fetch_navigation": False,
}
Use this mode when you want full control over the documentary unit: one fragment per <div>, <p>, <ab>, or any project-specific TEI node.
Ready-to-copy configurations¶
Full document per resource¶
td = ThunderDots(
endpoint_dts=ENDPOINT_DTS,
collection_params={"collection_id": COLLECTION_ID},
resource_params={
"fragment_mode": "document",
"fetch_document": True,
"fetch_navigation": False,
"add_head_to_content": False,
"include_breadcrumb": False,
},
)
DTS navigation fragments¶
td = ThunderDots(
endpoint_dts=ENDPOINT_DTS,
collection_params={"collection_id": COLLECTION_ID},
resource_params={
"fragment_mode": "navigation",
"fetch_document": True,
"fetch_navigation": True,
"metadata_dublincore": ["title", "creator", "date", "coverage"],
"metadata_extensions": ["dct:coverage", "dct:extend"],
"add_head_to_content": False,
"include_breadcrumb": True,
"exclude_heads_contains": [
"index",
"appendices",
"annexes",
"sources",
"bibliographie",
"iconographie",
],
},
)
TEI division fragments¶
td = ThunderDots(
endpoint_dts=ENDPOINT_DTS,
collection_params={"collection_id": COLLECTION_ID},
resource_params={
"fragment_mode": "tei_xpath",
"fragment_xpath": ".//tei:text/tei:body/tei:div",
"title_xpath": "./tei:head",
"remove_fragment_heads": True,
"add_head_to_content": False,
"fetch_document": True,
"fetch_navigation": False,
"include_breadcrumb": True,
"generated_id_prefix": "__DOCUMENT__",
},
)
Deprecated compatibility parameter¶
keep_metadata is still accepted in both collection_params and resource_params, but it emits a DeprecationWarning.
resource_params = {
"keep_metadata": ["dublincore.creator", "dct:coverage", "extensions.download"],
}
Prefer the explicit form:
resource_params = {
"metadata_dublincore": ["creator"],
"metadata_extensions": ["dct:coverage", "download"],
}