Indexing#

Search#

class Lookup(id: str, data: dict[str, Any], score: float = 0.0)[source]#

Bases: NamedTuple

Search result entry.

class SearchIndex[source]#

Bases: object

In-memory search index for entity lookup with N-gram tokenization.

SCORE_ROUND: ClassVar[int] = 2#: Number of decimals for score rounding.

JACCARD_WEIGHT: ClassVar[float] = 0.8#: Weight factor for Jaccard similarity in scoring.

build(path, data)[source]#: Build index from structured data at given path.

get(entity_id)[source]#

Retrieve entity data by ID.

search(query, threshold)[source]#

Search entities with similarity scoring.

Parameters:

Return type:

list[Lookup]

Returns:

List of matched entities sorted by relevance.

get(path)[source]#

Retrieve parser for file path.

Return type:: Callable[[Any], Iterator[tuple[str, dict[str, Any], list[str]]]]

normalize(text)[source]#

Normalize text: case folding, unicode normalization, punctuation removal.

words(text)[source]#

Tokenize text into normalized word set.

ngrams(token, n=3)[source]#

Generate N-grams from token with edge padding.

ngramize(text)[source]#

Convert text to N-gram set for fuzzy matching.