Indexing#

Search#

class Lookup(id: str, data: dict[str, Any], score: float = 0.0)[source]#

Bases: NamedTuple

Search result entry.

id: str#

Alias for field number 0

data: dict[str, Any]#

Alias for field number 1

score: float#

Similarity score between query and entity (0.0-1.0).

class SearchIndex[source]#

Bases: object

In-memory search index for entity lookup with N-gram tokenization.

SCORE_ROUND: ClassVar[int] = 2#

Number of decimals for score rounding.

JACCARD_WEIGHT: ClassVar[float] = 0.8#

Weight factor for Jaccard similarity in scoring.

build(path, data)[source]#

Build index from structured data at given path.

get(entity_id)[source]#

Retrieve entity data by ID.

Return type:

dict[str, Any] | None

search(query, threshold)[source]#

Search entities with similarity scoring.

Parameters:
  • query (str) – Search text.

  • threshold (float) – Minimum similarity score (0.0-1.0).

Return type:

list[Lookup]

Returns:

List of matched entities sorted by relevance.

Parsing#

get(path)[source]#

Retrieve parser for file path.

Return type:

Callable[[Any], Iterator[tuple[str, dict[str, Any], list[str]]]]

Tokenize#

normalize(text)[source]#

Normalize text: case folding, unicode normalization, punctuation removal.

Return type:

str

words(text)[source]#

Tokenize text into normalized word set.

Return type:

set[str]

ngrams(token, n=3)[source]#

Generate N-grams from token with edge padding.

Return type:

set[str]

ngramize(text)[source]#

Convert text to N-gram set for fuzzy matching.

Return type:

set[str]