Skip to content

dlgforge.pipeline.sampling

Question sampling, coverage memory, and seed-topic helpers.

build_question_inputs(base_inputs, turn_index, n_turns, public_history, used_topic_ids, recent_ledger_questions, doc_usage, doc_chunk_counts, doc_recent_questions, avoid_sources, forced_mode, used_seed_hashes, seed_topic_usage)

Build question inputs.

Parameters:

Name Type Description Default
base_inputs Dict[str, Any]

Mapping payload for this operation.

required
turn_index int

Numeric control value for processing behavior.

required
n_turns int

Numeric control value for processing behavior.

required
public_history List[Dict[str, Any]]

Conversation or message data used during processing.

required
used_topic_ids set

set value used by this operation.

required
recent_ledger_questions List[str]

List[str] value used by this operation.

required
doc_usage Dict[str, int]

Dict[str, int] value used by this operation.

required
doc_chunk_counts Dict[str, int]

Dict[str, int] value used by this operation.

required
doc_recent_questions Dict[str, List[str]]

Dict[str, List[str]] value used by this operation.

required
avoid_sources set[str]

set[str] value used by this operation.

required
forced_mode str

str value used by this operation.

required
used_seed_hashes set[str]

set[str] value used by this operation.

required
seed_topic_usage Dict[str, int]

Dict[str, int] value used by this operation.

required

Returns:

Type Description
Dict[str, Any]

Dict[str, Any]: Constructed value derived from the provided inputs.

Raises:

Type Description
Exception

Propagates unexpected runtime errors from downstream calls.

Side Effects / I/O: - Primarily performs in-memory transformations.

Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.

Examples:

>>> from dlgforge.pipeline.sampling import build_question_inputs
>>> build_question_inputs(...)

build_rng(base_inputs, turn_index)

Build rng.

Parameters:

Name Type Description Default
base_inputs Dict[str, Any]

Mapping payload for this operation.

required
turn_index int

Numeric control value for processing behavior.

required

Returns:

Type Description
Random

random.Random: Constructed value derived from the provided inputs.

Raises:

Type Description
Exception

Propagates unexpected runtime errors from downstream calls.

Side Effects / I/O: - Primarily performs in-memory transformations.

Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.

Examples:

>>> from dlgforge.pipeline.sampling import build_rng
>>> build_rng(...)

select_question_mode(turn_index, n_turns, has_assistant, rng, seed_query)

Select question mode.

Parameters:

Name Type Description Default
turn_index int

Numeric control value for processing behavior.

required
n_turns int

Numeric control value for processing behavior.

required
has_assistant bool

bool value used by this operation.

required
rng Random

random.Random value used by this operation.

required
seed_query str

str value used by this operation.

required

Returns:

Name Type Description
str str

Value produced by this API.

Raises:

Type Description
Exception

Propagates unexpected runtime errors from downstream calls.

Side Effects / I/O: - Primarily performs in-memory transformations.

Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.

Examples:

>>> from dlgforge.pipeline.sampling import select_question_mode
>>> select_question_mode(...)

sample_topic_snippets(mode, seed_query, last_assistant_message, used_topic_ids, doc_usage, doc_chunk_counts, doc_recent_questions, avoid_sources, rng)

Sample topic snippets.

Parameters:

Name Type Description Default
mode str

str value used by this operation.

required
seed_query str

str value used by this operation.

required
last_assistant_message str

str value used by this operation.

required
used_topic_ids set

set value used by this operation.

required
doc_usage Dict[str, int]

Dict[str, int] value used by this operation.

required
doc_chunk_counts Dict[str, int]

Dict[str, int] value used by this operation.

required
doc_recent_questions Dict[str, List[str]]

Dict[str, List[str]] value used by this operation.

required
avoid_sources set[str]

set[str] value used by this operation.

required
rng Random

random.Random value used by this operation.

required

Returns:

Type Description
List[Dict[str, str]]

List[Dict[str, str]]: Value produced by this API.

Raises:

Type Description
Exception

Propagates unexpected runtime errors from downstream calls.

Side Effects / I/O: - Primarily performs in-memory transformations.

Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.

Examples:

>>> from dlgforge.pipeline.sampling import sample_topic_snippets
>>> sample_topic_snippets(...)

format_source_descriptor(metadata)

Format source descriptor.

Parameters:

Name Type Description Default
metadata Dict[str, Any]

Mapping payload for this operation.

required

Returns:

Name Type Description
str str

Value produced by this API.

Raises:

Type Description
Exception

Propagates unexpected runtime errors from downstream calls.

Side Effects / I/O: - Primarily performs in-memory transformations.

Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.

Examples:

>>> from dlgforge.pipeline.sampling import format_source_descriptor
>>> format_source_descriptor(...)

build_doc_usage(ledger_entries)

Build doc usage.

Parameters:

Name Type Description Default
ledger_entries List[Dict[str, Any]]

List[Dict[str, Any]] value used by this operation.

required

Returns:

Type Description
Dict[str, int]

Dict[str, int]: Constructed value derived from the provided inputs.

Raises:

Type Description
Exception

Propagates unexpected runtime errors from downstream calls.

Side Effects / I/O: - Primarily performs in-memory transformations.

Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.

Examples:

>>> from dlgforge.pipeline.sampling import build_doc_usage
>>> build_doc_usage(...)

build_doc_chunk_counts()

Build doc chunk counts.

Returns:

Type Description
Dict[str, int]

Dict[str, int]: Constructed value derived from the provided inputs.

Raises:

Type Description
Exception

Propagates unexpected runtime errors from downstream calls.

Side Effects / I/O: - Primarily performs in-memory transformations.

Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.

Examples:

>>> from dlgforge.pipeline.sampling import build_doc_chunk_counts
>>> build_doc_chunk_counts(...)

build_doc_question_hashes(ledger_entries)

Build doc question hashes.

Parameters:

Name Type Description Default
ledger_entries List[Dict[str, Any]]

List[Dict[str, Any]] value used by this operation.

required

Returns:

Type Description
Dict[str, set[str]]

Dict[str, set[str]]: Constructed value derived from the provided inputs.

Raises:

Type Description
Exception

Propagates unexpected runtime errors from downstream calls.

Side Effects / I/O: - Primarily performs in-memory transformations.

Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.

Examples:

>>> from dlgforge.pipeline.sampling import build_doc_question_hashes
>>> build_doc_question_hashes(...)

build_doc_recent_questions(ledger_entries, max_per_doc=8)

Build doc recent questions.

Parameters:

Name Type Description Default
ledger_entries List[Dict[str, Any]]

List[Dict[str, Any]] value used by this operation.

required
max_per_doc int

int value used by this operation.

8

Returns:

Type Description
Dict[str, List[str]]

Dict[str, List[str]]: Constructed value derived from the provided inputs.

Raises:

Type Description
Exception

Propagates unexpected runtime errors from downstream calls.

Side Effects / I/O: - Primarily performs in-memory transformations.

Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.

Examples:

>>> from dlgforge.pipeline.sampling import build_doc_recent_questions
>>> build_doc_recent_questions(...)

select_doc_pool(doc_usage, doc_chunk_counts, all_sources, rng)

Select doc pool.

Parameters:

Name Type Description Default
doc_usage Dict[str, int]

Dict[str, int] value used by this operation.

required
doc_chunk_counts Dict[str, int]

Dict[str, int] value used by this operation.

required
all_sources List[str]

List[str] value used by this operation.

required
rng Random

random.Random value used by this operation.

required

Returns:

Type Description
set[str]

set[str]: Value produced by this API.

Raises:

Type Description
Exception

Propagates unexpected runtime errors from downstream calls.

Side Effects / I/O: - Primarily performs in-memory transformations.

Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.

Examples:

>>> from dlgforge.pipeline.sampling import select_doc_pool
>>> select_doc_pool(...)

coverage_ratio(source, doc_usage, doc_chunk_counts)

Coverage ratio.

Parameters:

Name Type Description Default
source str

Filesystem path used by this operation.

required
doc_usage Dict[str, int]

Dict[str, int] value used by this operation.

required
doc_chunk_counts Dict[str, int]

Dict[str, int] value used by this operation.

required

Returns:

Name Type Description
float float

Value produced by this API.

Raises:

Type Description
Exception

Propagates unexpected runtime errors from downstream calls.

Side Effects / I/O: - Primarily performs in-memory transformations.

Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.

Examples:

>>> from dlgforge.pipeline.sampling import coverage_ratio
>>> coverage_ratio(...)

filter_results_by_sources(results, preferred_sources, allow_fallback)

Filter results by sources.

Parameters:

Name Type Description Default
results List[tuple]

List[tuple] value used by this operation.

required
preferred_sources set[str]

set[str] value used by this operation.

required
allow_fallback bool

bool value used by this operation.

required

Returns:

Type Description
List[tuple]

List[tuple]: Value produced by this API.

Raises:

Type Description
Exception

Propagates unexpected runtime errors from downstream calls.

Side Effects / I/O: - Primarily performs in-memory transformations.

Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.

Examples:

>>> from dlgforge.pipeline.sampling import filter_results_by_sources
>>> filter_results_by_sources(...)

clamp_float(raw, default)

Clamp float.

Parameters:

Name Type Description Default
raw str

str value used by this operation.

required
default float

float value used by this operation.

required

Returns:

Name Type Description
float float

Value produced by this API.

Raises:

Type Description
Exception

Propagates unexpected runtime errors from downstream calls.

Side Effects / I/O: - Primarily performs in-memory transformations.

Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.

Examples:

>>> from dlgforge.pipeline.sampling import clamp_float
>>> clamp_float(...)

update_coverage_ledger(paths, qa_output, question_inputs, used_topic_ids, used_question_hashes)

Update coverage ledger.

Parameters:

Name Type Description Default
paths OutputPaths

Filesystem path used by this operation.

required
qa_output Dict[str, Any]

Dict[str, Any] value used by this operation.

required
question_inputs Dict[str, Any]

Dict[str, Any] value used by this operation.

required
used_topic_ids set

set value used by this operation.

required
used_question_hashes set

set value used by this operation.

required

Returns:

Name Type Description
None None

No value is returned.

Raises:

Type Description
Exception

Propagates unexpected runtime errors from downstream calls.

Side Effects / I/O: - Primarily performs in-memory transformations.

Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.

Examples:

>>> from dlgforge.pipeline.sampling import update_coverage_ledger
>>> update_coverage_ledger(...)

is_duplicate_question(qa_output, question_inputs, doc_question_hashes)

Check whether duplicate question.

Parameters:

Name Type Description Default
qa_output Dict[str, Any]

Dict[str, Any] value used by this operation.

required
question_inputs Dict[str, Any]

Dict[str, Any] value used by this operation.

required
doc_question_hashes Dict[str, set[str]]

Dict[str, set[str]] value used by this operation.

required

Returns:

Type Description
tuple[bool, str]

tuple[bool, str]: Boolean indicator describing the evaluated condition.

Raises:

Type Description
Exception

Propagates unexpected runtime errors from downstream calls.

Side Effects / I/O: - Primarily performs in-memory transformations.

Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.

Examples:

>>> from dlgforge.pipeline.sampling import is_duplicate_question
>>> is_duplicate_question(...)

update_doc_question_memory(qa_output, question_inputs, doc_question_hashes, doc_recent_questions)

Update doc question memory.

Parameters:

Name Type Description Default
qa_output Dict[str, Any]

Dict[str, Any] value used by this operation.

required
question_inputs Dict[str, Any]

Dict[str, Any] value used by this operation.

required
doc_question_hashes Dict[str, set[str]]

Dict[str, set[str]] value used by this operation.

required
doc_recent_questions Dict[str, List[str]]

Dict[str, List[str]] value used by this operation.

required

Returns:

Name Type Description
None None

No value is returned.

Raises:

Type Description
Exception

Propagates unexpected runtime errors from downstream calls.

Side Effects / I/O: - Primarily performs in-memory transformations.

Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.

Examples:

>>> from dlgforge.pipeline.sampling import update_doc_question_memory
>>> update_doc_question_memory(...)

lookup_source_path(question_inputs, topic_id)

Look up source path.

Parameters:

Name Type Description Default
question_inputs Dict[str, Any]

Dict[str, Any] value used by this operation.

required
topic_id str

str value used by this operation.

required

Returns:

Name Type Description
str str

Value produced by this API.

Raises:

Type Description
Exception

Propagates unexpected runtime errors from downstream calls.

Side Effects / I/O: - Primarily performs in-memory transformations.

Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.

Examples:

>>> from dlgforge.pipeline.sampling import lookup_source_path
>>> lookup_source_path(...)

maybe_select_seed_question(base_inputs, turn_index, rng, used_seed_hashes, seed_topic_usage)

Conditionally execute select seed question.

Parameters:

Name Type Description Default
base_inputs Dict[str, Any]

Mapping payload for this operation.

required
turn_index int

Numeric control value for processing behavior.

required
rng Random

random.Random value used by this operation.

required
used_seed_hashes set[str]

set[str] value used by this operation.

required
seed_topic_usage Dict[str, int]

Dict[str, int] value used by this operation.

required

Returns:

Type Description
tuple[str, str]

tuple[str, str]: Value produced by this API.

Raises:

Type Description
Exception

Propagates unexpected runtime errors from downstream calls.

Side Effects / I/O: - Primarily performs in-memory transformations.

Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.

Examples:

>>> from dlgforge.pipeline.sampling import maybe_select_seed_question
>>> maybe_select_seed_question(...)

load_seed_topics(path, project_root, config_dir, target_language='', seed_topics_variant='')

Load seed topics.

Parameters:

Name Type Description Default
path str

Filesystem path used by this operation.

required
project_root Path

Resolved project directory context.

required
config_dir Path

Resolved project directory context.

required
target_language str

str value used by this operation.

''
seed_topics_variant str

str value used by this operation.

''

Returns:

Type Description
Dict[str, List[str]]

Dict[str, List[str]]: Loaded value parsed from upstream sources.

Raises:

Type Description
Exception

Propagates unexpected runtime errors from downstream calls.

Side Effects / I/O: - Primarily performs in-memory transformations.

Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.

Examples:

>>> from dlgforge.pipeline.sampling import load_seed_topics
>>> load_seed_topics(...)

select_seed_candidate(seed_topics, used_seed_hashes, seed_topic_usage, rng)

Select seed candidate.

Parameters:

Name Type Description Default
seed_topics Dict[str, List[str]]

Dict[str, List[str]] value used by this operation.

required
used_seed_hashes set[str]

set[str] value used by this operation.

required
seed_topic_usage Dict[str, int]

Dict[str, int] value used by this operation.

required
rng Random

random.Random value used by this operation.

required

Returns:

Type Description
tuple[str, str]

tuple[str, str]: Value produced by this API.

Raises:

Type Description
Exception

Propagates unexpected runtime errors from downstream calls.

Side Effects / I/O: - Primarily performs in-memory transformations.

Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.

Examples:

>>> from dlgforge.pipeline.sampling import select_seed_candidate
>>> select_seed_candidate(...)

build_used_seed_hashes(ledger_entries)

Build used seed hashes.

Parameters:

Name Type Description Default
ledger_entries List[Dict[str, Any]]

List[Dict[str, Any]] value used by this operation.

required

Returns:

Type Description
set[str]

set[str]: Constructed value derived from the provided inputs.

Raises:

Type Description
Exception

Propagates unexpected runtime errors from downstream calls.

Side Effects / I/O: - Primarily performs in-memory transformations.

Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.

Examples:

>>> from dlgforge.pipeline.sampling import build_used_seed_hashes
>>> build_used_seed_hashes(...)

build_seed_topic_usage(ledger_entries)

Build seed topic usage.

Parameters:

Name Type Description Default
ledger_entries List[Dict[str, Any]]

List[Dict[str, Any]] value used by this operation.

required

Returns:

Type Description
Dict[str, int]

Dict[str, int]: Constructed value derived from the provided inputs.

Raises:

Type Description
Exception

Propagates unexpected runtime errors from downstream calls.

Side Effects / I/O: - Primarily performs in-memory transformations.

Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.

Examples:

>>> from dlgforge.pipeline.sampling import build_seed_topic_usage
>>> build_seed_topic_usage(...)

update_seed_memory(qa_output, question_inputs, used_seed_hashes, seed_topic_usage)

Update seed memory.

Parameters:

Name Type Description Default
qa_output Dict[str, Any]

Dict[str, Any] value used by this operation.

required
question_inputs Dict[str, Any]

Dict[str, Any] value used by this operation.

required
used_seed_hashes set[str]

set[str] value used by this operation.

required
seed_topic_usage Dict[str, int]

Dict[str, int] value used by this operation.

required

Returns:

Name Type Description
None None

No value is returned.

Raises:

Type Description
Exception

Propagates unexpected runtime errors from downstream calls.

Side Effects / I/O: - Primarily performs in-memory transformations.

Preconditions / Invariants: - Callers should provide arguments matching annotated types and expected data contracts.

Examples:

>>> from dlgforge.pipeline.sampling import update_seed_memory
>>> update_seed_memory(...)