dlgforge.pipeline.sampling¶
Question sampling, coverage memory, and seed-topic helpers.
build_question_inputs(base_inputs, turn_index, n_turns, public_history, used_topic_ids, recent_ledger_questions, doc_usage, doc_chunk_counts, doc_recent_questions, avoid_sources, forced_mode, used_seed_hashes, seed_topic_usage)
¶
Build question inputs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_inputs
|
Dict[str, Any]
|
Mapping payload for this operation. |
required |
turn_index
|
int
|
Numeric control value for processing behavior. |
required |
n_turns
|
int
|
Numeric control value for processing behavior. |
required |
public_history
|
List[Dict[str, Any]]
|
Conversation or message data used during processing. |
required |
used_topic_ids
|
set
|
set value used by this operation. |
required |
recent_ledger_questions
|
List[str]
|
List[str] value used by this operation. |
required |
doc_usage
|
Dict[str, int]
|
Dict[str, int] value used by this operation. |
required |
doc_chunk_counts
|
Dict[str, int]
|
Dict[str, int] value used by this operation. |
required |
doc_recent_questions
|
Dict[str, List[str]]
|
Dict[str, List[str]] value used by this operation. |
required |
avoid_sources
|
set[str]
|
set[str] value used by this operation. |
required |
forced_mode
|
str
|
str value used by this operation. |
required |
used_seed_hashes
|
set[str]
|
set[str] value used by this operation. |
required |
seed_topic_usage
|
Dict[str, int]
|
Dict[str, int] value used by this operation. |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
Dict[str, Any]: Constructed value derived from the provided inputs. |
Raises:
| Type | Description |
|---|---|
Exception
|
Propagates unexpected runtime errors from downstream calls. |
Side Effects / I/O: - Primarily performs in-memory transformations.
Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.
Examples:
>>> from dlgforge.pipeline.sampling import build_question_inputs
>>> build_question_inputs(...)
build_rng(base_inputs, turn_index)
¶
Build rng.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_inputs
|
Dict[str, Any]
|
Mapping payload for this operation. |
required |
turn_index
|
int
|
Numeric control value for processing behavior. |
required |
Returns:
| Type | Description |
|---|---|
Random
|
random.Random: Constructed value derived from the provided inputs. |
Raises:
| Type | Description |
|---|---|
Exception
|
Propagates unexpected runtime errors from downstream calls. |
Side Effects / I/O: - Primarily performs in-memory transformations.
Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.
Examples:
>>> from dlgforge.pipeline.sampling import build_rng
>>> build_rng(...)
select_question_mode(turn_index, n_turns, has_assistant, rng, seed_query)
¶
Select question mode.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
turn_index
|
int
|
Numeric control value for processing behavior. |
required |
n_turns
|
int
|
Numeric control value for processing behavior. |
required |
has_assistant
|
bool
|
bool value used by this operation. |
required |
rng
|
Random
|
random.Random value used by this operation. |
required |
seed_query
|
str
|
str value used by this operation. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Value produced by this API. |
Raises:
| Type | Description |
|---|---|
Exception
|
Propagates unexpected runtime errors from downstream calls. |
Side Effects / I/O: - Primarily performs in-memory transformations.
Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.
Examples:
>>> from dlgforge.pipeline.sampling import select_question_mode
>>> select_question_mode(...)
sample_topic_snippets(mode, seed_query, last_assistant_message, used_topic_ids, doc_usage, doc_chunk_counts, doc_recent_questions, avoid_sources, rng)
¶
Sample topic snippets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mode
|
str
|
str value used by this operation. |
required |
seed_query
|
str
|
str value used by this operation. |
required |
last_assistant_message
|
str
|
str value used by this operation. |
required |
used_topic_ids
|
set
|
set value used by this operation. |
required |
doc_usage
|
Dict[str, int]
|
Dict[str, int] value used by this operation. |
required |
doc_chunk_counts
|
Dict[str, int]
|
Dict[str, int] value used by this operation. |
required |
doc_recent_questions
|
Dict[str, List[str]]
|
Dict[str, List[str]] value used by this operation. |
required |
avoid_sources
|
set[str]
|
set[str] value used by this operation. |
required |
rng
|
Random
|
random.Random value used by this operation. |
required |
Returns:
| Type | Description |
|---|---|
List[Dict[str, str]]
|
List[Dict[str, str]]: Value produced by this API. |
Raises:
| Type | Description |
|---|---|
Exception
|
Propagates unexpected runtime errors from downstream calls. |
Side Effects / I/O: - Primarily performs in-memory transformations.
Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.
Examples:
>>> from dlgforge.pipeline.sampling import sample_topic_snippets
>>> sample_topic_snippets(...)
format_source_descriptor(metadata)
¶
Format source descriptor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
Dict[str, Any]
|
Mapping payload for this operation. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Value produced by this API. |
Raises:
| Type | Description |
|---|---|
Exception
|
Propagates unexpected runtime errors from downstream calls. |
Side Effects / I/O: - Primarily performs in-memory transformations.
Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.
Examples:
>>> from dlgforge.pipeline.sampling import format_source_descriptor
>>> format_source_descriptor(...)
build_doc_usage(ledger_entries)
¶
Build doc usage.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ledger_entries
|
List[Dict[str, Any]]
|
List[Dict[str, Any]] value used by this operation. |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, int]
|
Dict[str, int]: Constructed value derived from the provided inputs. |
Raises:
| Type | Description |
|---|---|
Exception
|
Propagates unexpected runtime errors from downstream calls. |
Side Effects / I/O: - Primarily performs in-memory transformations.
Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.
Examples:
>>> from dlgforge.pipeline.sampling import build_doc_usage
>>> build_doc_usage(...)
build_doc_chunk_counts()
¶
Build doc chunk counts.
Returns:
| Type | Description |
|---|---|
Dict[str, int]
|
Dict[str, int]: Constructed value derived from the provided inputs. |
Raises:
| Type | Description |
|---|---|
Exception
|
Propagates unexpected runtime errors from downstream calls. |
Side Effects / I/O: - Primarily performs in-memory transformations.
Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.
Examples:
>>> from dlgforge.pipeline.sampling import build_doc_chunk_counts
>>> build_doc_chunk_counts(...)
build_doc_question_hashes(ledger_entries)
¶
Build doc question hashes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ledger_entries
|
List[Dict[str, Any]]
|
List[Dict[str, Any]] value used by this operation. |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, set[str]]
|
Dict[str, set[str]]: Constructed value derived from the provided inputs. |
Raises:
| Type | Description |
|---|---|
Exception
|
Propagates unexpected runtime errors from downstream calls. |
Side Effects / I/O: - Primarily performs in-memory transformations.
Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.
Examples:
>>> from dlgforge.pipeline.sampling import build_doc_question_hashes
>>> build_doc_question_hashes(...)
build_doc_recent_questions(ledger_entries, max_per_doc=8)
¶
Build doc recent questions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ledger_entries
|
List[Dict[str, Any]]
|
List[Dict[str, Any]] value used by this operation. |
required |
max_per_doc
|
int
|
int value used by this operation. |
8
|
Returns:
| Type | Description |
|---|---|
Dict[str, List[str]]
|
Dict[str, List[str]]: Constructed value derived from the provided inputs. |
Raises:
| Type | Description |
|---|---|
Exception
|
Propagates unexpected runtime errors from downstream calls. |
Side Effects / I/O: - Primarily performs in-memory transformations.
Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.
Examples:
>>> from dlgforge.pipeline.sampling import build_doc_recent_questions
>>> build_doc_recent_questions(...)
select_doc_pool(doc_usage, doc_chunk_counts, all_sources, rng)
¶
Select doc pool.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
doc_usage
|
Dict[str, int]
|
Dict[str, int] value used by this operation. |
required |
doc_chunk_counts
|
Dict[str, int]
|
Dict[str, int] value used by this operation. |
required |
all_sources
|
List[str]
|
List[str] value used by this operation. |
required |
rng
|
Random
|
random.Random value used by this operation. |
required |
Returns:
| Type | Description |
|---|---|
set[str]
|
set[str]: Value produced by this API. |
Raises:
| Type | Description |
|---|---|
Exception
|
Propagates unexpected runtime errors from downstream calls. |
Side Effects / I/O: - Primarily performs in-memory transformations.
Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.
Examples:
>>> from dlgforge.pipeline.sampling import select_doc_pool
>>> select_doc_pool(...)
coverage_ratio(source, doc_usage, doc_chunk_counts)
¶
Coverage ratio.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
str
|
Filesystem path used by this operation. |
required |
doc_usage
|
Dict[str, int]
|
Dict[str, int] value used by this operation. |
required |
doc_chunk_counts
|
Dict[str, int]
|
Dict[str, int] value used by this operation. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
float |
float
|
Value produced by this API. |
Raises:
| Type | Description |
|---|---|
Exception
|
Propagates unexpected runtime errors from downstream calls. |
Side Effects / I/O: - Primarily performs in-memory transformations.
Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.
Examples:
>>> from dlgforge.pipeline.sampling import coverage_ratio
>>> coverage_ratio(...)
filter_results_by_sources(results, preferred_sources, allow_fallback)
¶
Filter results by sources.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results
|
List[tuple]
|
List[tuple] value used by this operation. |
required |
preferred_sources
|
set[str]
|
set[str] value used by this operation. |
required |
allow_fallback
|
bool
|
bool value used by this operation. |
required |
Returns:
| Type | Description |
|---|---|
List[tuple]
|
List[tuple]: Value produced by this API. |
Raises:
| Type | Description |
|---|---|
Exception
|
Propagates unexpected runtime errors from downstream calls. |
Side Effects / I/O: - Primarily performs in-memory transformations.
Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.
Examples:
>>> from dlgforge.pipeline.sampling import filter_results_by_sources
>>> filter_results_by_sources(...)
clamp_float(raw, default)
¶
Clamp float.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
raw
|
str
|
str value used by this operation. |
required |
default
|
float
|
float value used by this operation. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
float |
float
|
Value produced by this API. |
Raises:
| Type | Description |
|---|---|
Exception
|
Propagates unexpected runtime errors from downstream calls. |
Side Effects / I/O: - Primarily performs in-memory transformations.
Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.
Examples:
>>> from dlgforge.pipeline.sampling import clamp_float
>>> clamp_float(...)
update_coverage_ledger(paths, qa_output, question_inputs, used_topic_ids, used_question_hashes)
¶
Update coverage ledger.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
paths
|
OutputPaths
|
Filesystem path used by this operation. |
required |
qa_output
|
Dict[str, Any]
|
Dict[str, Any] value used by this operation. |
required |
question_inputs
|
Dict[str, Any]
|
Dict[str, Any] value used by this operation. |
required |
used_topic_ids
|
set
|
set value used by this operation. |
required |
used_question_hashes
|
set
|
set value used by this operation. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
None |
None
|
No value is returned. |
Raises:
| Type | Description |
|---|---|
Exception
|
Propagates unexpected runtime errors from downstream calls. |
Side Effects / I/O: - Primarily performs in-memory transformations.
Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.
Examples:
>>> from dlgforge.pipeline.sampling import update_coverage_ledger
>>> update_coverage_ledger(...)
is_duplicate_question(qa_output, question_inputs, doc_question_hashes)
¶
Check whether duplicate question.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
qa_output
|
Dict[str, Any]
|
Dict[str, Any] value used by this operation. |
required |
question_inputs
|
Dict[str, Any]
|
Dict[str, Any] value used by this operation. |
required |
doc_question_hashes
|
Dict[str, set[str]]
|
Dict[str, set[str]] value used by this operation. |
required |
Returns:
| Type | Description |
|---|---|
tuple[bool, str]
|
tuple[bool, str]: Boolean indicator describing the evaluated condition. |
Raises:
| Type | Description |
|---|---|
Exception
|
Propagates unexpected runtime errors from downstream calls. |
Side Effects / I/O: - Primarily performs in-memory transformations.
Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.
Examples:
>>> from dlgforge.pipeline.sampling import is_duplicate_question
>>> is_duplicate_question(...)
update_doc_question_memory(qa_output, question_inputs, doc_question_hashes, doc_recent_questions)
¶
Update doc question memory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
qa_output
|
Dict[str, Any]
|
Dict[str, Any] value used by this operation. |
required |
question_inputs
|
Dict[str, Any]
|
Dict[str, Any] value used by this operation. |
required |
doc_question_hashes
|
Dict[str, set[str]]
|
Dict[str, set[str]] value used by this operation. |
required |
doc_recent_questions
|
Dict[str, List[str]]
|
Dict[str, List[str]] value used by this operation. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
None |
None
|
No value is returned. |
Raises:
| Type | Description |
|---|---|
Exception
|
Propagates unexpected runtime errors from downstream calls. |
Side Effects / I/O: - Primarily performs in-memory transformations.
Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.
Examples:
>>> from dlgforge.pipeline.sampling import update_doc_question_memory
>>> update_doc_question_memory(...)
lookup_source_path(question_inputs, topic_id)
¶
Look up source path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
question_inputs
|
Dict[str, Any]
|
Dict[str, Any] value used by this operation. |
required |
topic_id
|
str
|
str value used by this operation. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Value produced by this API. |
Raises:
| Type | Description |
|---|---|
Exception
|
Propagates unexpected runtime errors from downstream calls. |
Side Effects / I/O: - Primarily performs in-memory transformations.
Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.
Examples:
>>> from dlgforge.pipeline.sampling import lookup_source_path
>>> lookup_source_path(...)
maybe_select_seed_question(base_inputs, turn_index, rng, used_seed_hashes, seed_topic_usage)
¶
Conditionally execute select seed question.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_inputs
|
Dict[str, Any]
|
Mapping payload for this operation. |
required |
turn_index
|
int
|
Numeric control value for processing behavior. |
required |
rng
|
Random
|
random.Random value used by this operation. |
required |
used_seed_hashes
|
set[str]
|
set[str] value used by this operation. |
required |
seed_topic_usage
|
Dict[str, int]
|
Dict[str, int] value used by this operation. |
required |
Returns:
| Type | Description |
|---|---|
tuple[str, str]
|
tuple[str, str]: Value produced by this API. |
Raises:
| Type | Description |
|---|---|
Exception
|
Propagates unexpected runtime errors from downstream calls. |
Side Effects / I/O: - Primarily performs in-memory transformations.
Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.
Examples:
>>> from dlgforge.pipeline.sampling import maybe_select_seed_question
>>> maybe_select_seed_question(...)
load_seed_topics(path, project_root, config_dir, target_language='', seed_topics_variant='')
¶
Load seed topics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Filesystem path used by this operation. |
required |
project_root
|
Path
|
Resolved project directory context. |
required |
config_dir
|
Path
|
Resolved project directory context. |
required |
target_language
|
str
|
str value used by this operation. |
''
|
seed_topics_variant
|
str
|
str value used by this operation. |
''
|
Returns:
| Type | Description |
|---|---|
Dict[str, List[str]]
|
Dict[str, List[str]]: Loaded value parsed from upstream sources. |
Raises:
| Type | Description |
|---|---|
Exception
|
Propagates unexpected runtime errors from downstream calls. |
Side Effects / I/O: - Primarily performs in-memory transformations.
Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.
Examples:
>>> from dlgforge.pipeline.sampling import load_seed_topics
>>> load_seed_topics(...)
select_seed_candidate(seed_topics, used_seed_hashes, seed_topic_usage, rng)
¶
Select seed candidate.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seed_topics
|
Dict[str, List[str]]
|
Dict[str, List[str]] value used by this operation. |
required |
used_seed_hashes
|
set[str]
|
set[str] value used by this operation. |
required |
seed_topic_usage
|
Dict[str, int]
|
Dict[str, int] value used by this operation. |
required |
rng
|
Random
|
random.Random value used by this operation. |
required |
Returns:
| Type | Description |
|---|---|
tuple[str, str]
|
tuple[str, str]: Value produced by this API. |
Raises:
| Type | Description |
|---|---|
Exception
|
Propagates unexpected runtime errors from downstream calls. |
Side Effects / I/O: - Primarily performs in-memory transformations.
Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.
Examples:
>>> from dlgforge.pipeline.sampling import select_seed_candidate
>>> select_seed_candidate(...)
build_used_seed_hashes(ledger_entries)
¶
Build used seed hashes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ledger_entries
|
List[Dict[str, Any]]
|
List[Dict[str, Any]] value used by this operation. |
required |
Returns:
| Type | Description |
|---|---|
set[str]
|
set[str]: Constructed value derived from the provided inputs. |
Raises:
| Type | Description |
|---|---|
Exception
|
Propagates unexpected runtime errors from downstream calls. |
Side Effects / I/O: - Primarily performs in-memory transformations.
Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.
Examples:
>>> from dlgforge.pipeline.sampling import build_used_seed_hashes
>>> build_used_seed_hashes(...)
build_seed_topic_usage(ledger_entries)
¶
Build seed topic usage.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ledger_entries
|
List[Dict[str, Any]]
|
List[Dict[str, Any]] value used by this operation. |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, int]
|
Dict[str, int]: Constructed value derived from the provided inputs. |
Raises:
| Type | Description |
|---|---|
Exception
|
Propagates unexpected runtime errors from downstream calls. |
Side Effects / I/O: - Primarily performs in-memory transformations.
Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.
Examples:
>>> from dlgforge.pipeline.sampling import build_seed_topic_usage
>>> build_seed_topic_usage(...)
update_seed_memory(qa_output, question_inputs, used_seed_hashes, seed_topic_usage)
¶
Update seed memory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
qa_output
|
Dict[str, Any]
|
Dict[str, Any] value used by this operation. |
required |
question_inputs
|
Dict[str, Any]
|
Dict[str, Any] value used by this operation. |
required |
used_seed_hashes
|
set[str]
|
set[str] value used by this operation. |
required |
seed_topic_usage
|
Dict[str, int]
|
Dict[str, int] value used by this operation. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
None |
None
|
No value is returned. |
Raises:
| Type | Description |
|---|---|
Exception
|
Propagates unexpected runtime errors from downstream calls. |
Side Effects / I/O: - Primarily performs in-memory transformations.
Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.
Examples:
>>> from dlgforge.pipeline.sampling import update_seed_memory
>>> update_seed_memory(...)