Documents and Embedding Config

Configuration example

cache_folder: /path/to/cache/folder ## specify a cache folder for embeddings models, huggingface and sentence transformers

embeddings:
  # ** Attention ** - `embedding_path` should be uniquer per configuration file.
  embeddings_path: /path/to/embedding/folder ## specify a folder where embeddings will be saved.
  
  embedding_model: # Optional embedding model specification, default is e5-large-v2. Swap to a smaller model if out of CUDA memory
    # Supported types: "huggingface", "instruct", "openai"
    type: sentence_transformer # other supported types - "huggingface" and "instruct"
    model_name: 'Qwen/Qwen3-Embedding-0.6B'

  
  splade_config: # Optional batch size of sparse embeddings. Reduce if getting out-of-memory errors on CUDA.
    n_batch: 5
  
  chunk_sizes: # Specify one more chunk size to split (querying multi-chunk results will be slower)
    - 1024

  document_settings:

  # Can specify multiple documents collections and filter by label

  - doc_path: /path/to/documents ## specify the docs folder
    exclude_paths: # Optional paths to exclude
      - /path/to/documents/subfolder1
      - /path/to/documents/subfolder2
    scan_extensions: # specifies files extensions to scan recursively in `doc_path`. 
      - pdf
      - md
    additional_parser_settings: # Optional section, don't have to include
      md: 
        skip_first: True  # Skip first section which often contains metadata
        merge_sections: False # Merge # headings if possible, can be turned on and off depending on document stucture
        remove_images: True # Remove image links
    
    # Optional setting
    # For azuredoc support - pip install "pyllmsearch[azureparser]"
    pdf_table_parser: gmft # azuredoc

    # Optional setting
    pdf_image_parser:
        image_parser: gemini-1.5-pro # gemini-1.5-flash
        system_instructions: |
            You are an research assistant. You analyze the image to extract detailed information. Response must be a Markdown string in the follwing format:
            - First line is a heading with image caption, starting with '# '
            - Second line is empty
            - From the third line on - detailed data points and related metadata, extracted from the image, in Markdown format. Don't use Markdown tables.

    
    passage_prefix: "passage: " # Often, specific prefix needs to be included in the source text, for embedding models to work properly
    label: "documment-collection-1" # Add a label to the current collection
  
  - doc_path: /another/path/to/documents ## specify the docs folder
    scan_extensions: # specifies files extensions to scan recursively in `doc_path`. 
      - md
    
    passage_prefix: "passage: " # Often, specific prefix needs to be included in the source text, for embedding models to work properly
    label: "documment-collection-2" # Add a label to the current collection

semantic_search:
  search_type: similarity # Currently, only similarity is supported
  replace_output_path: # Can specify list of search/replace settings
    - substring_search: "/storage/llm/docs/" ## Specifies substring to replace  in the output path of the document
      substring_replace: "obsidian://open?vault=knowledge-base&file=" ## Replaces with this string

  append_suffix: # Specifies additional template to append to an output path, useful for deep linking
    append_template: "#page={page}" # For example will append a page from metadata of the document parser

  # Will ensure that context provided to LLM is less than max_char_size. Useful for locally hosted models and limited hardware. 
  # Reduce if out of CUDA memory.
  max_char_size: 16384 # Reduce if necessary for locally hosted LLMs

  # Maximum number of text chunks to retrive for dense and sparse embeddings
  # Total number of chunks is max_k * 2
  max_k: 25
  
  query_prefix: "query: " # Often queries have to be prefixed for embedding models, such as e5

  score_cutoff: -3.0 # Optional reranker score cutoff. Documents below this score will be excluded from the returned document list

  hyde:
    enabled: False
  
  multiquery: 
    enabled: False
  
  reranker:
    enabled: True
    model: "bge" # for `BAAI/bge-reranker-base` or "marco" for cross-encoder/ms-marco-MiniLM-L-6-v2
  
  # Optionally enable conversation history settings (default False)
  conversation_history_settings:
    enabled: True
    max_history_length: 3
    rewrite_query: True

  

persist_response_db_path:  "/path/to/responses.db" # optional sqlite database filename. Allows to save responses offlien to sqlite, for future analysis.

Document Config Reference

class llmsearch.config.Config(*, cache_folder: Path, embeddings: EmbeddingsConfig, semantic_search: SemanticSearchConfig, llm: LLMConfig | None = None, persist_response_db_path: str | None = None)

cache_folder: Path

Configures path to cache LLM and embedding models.

check_embeddings_exist() → bool

Checks if embedings exist in the specified folder

embeddings: EmbeddingsConfig

Configures document paths and embedding settings.

llm: LLMConfig | None

Don’t use directly.

model_computed_fields = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields = {'cache_folder': FieldInfo(annotation=Path, required=True), 'embeddings': FieldInfo(annotation=EmbeddingsConfig, required=True), 'llm': FieldInfo(annotation=Union[LLMConfig, NoneType], required=False), 'persist_response_db_path': FieldInfo(annotation=Union[str, NoneType], required=False), 'semantic_search': FieldInfo(annotation=SemanticSearchConfig, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

persist_response_db_path: str | None

Optional path for SQLite database for results storage.

semantic_search: SemanticSearchConfig

Confgures semantic search settings.

class llmsearch.config.ConversationHistoryQAPair(*, question: str, answer: str)

model_computed_fields = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields = {'answer': FieldInfo(annotation=str, required=True), 'question': FieldInfo(annotation=str, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class llmsearch.config.ConversrationHistorySettings(*, enabled: bool = False, max_history_length: int, rewrite_query: bool, history: List[ConversationHistoryQAPair] = None, template_instruction: str = 'When answering questions, take into consideration the history of the chat converastion, which is listed below under Chat History. The chat history is in reverse chronological order, so the most recent exhange is at the top.', template_contextualize: str = "\n    Given a chat history and the latest user question which might reference to context in the chat history, formulate a standalone question which can be understood without the chat history. Do NOT answer the question, return only reformulated question. Do NOT mention it is 'reformulated question', return only body of the question and nothing else.\n\n    {chat_history}\n\n    User question: {user_question}\n    ", template_header: str = '\nChat History:\n=============\n', template_qa_pairs: str = 'User: {question}\nAssistant: {answer}\n\n')

history: List[ConversationHistoryQAPair]

Keeps history of conversation pair, up to max_history_length

max_history_length: int

Maximum length of conversational history paris to remember (single pair = query + response)

model_computed_fields = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields = {'enabled': FieldInfo(annotation=bool, required=False, default=False), 'history': FieldInfo(annotation=List[ConversationHistoryQAPair], required=False, default_factory=list), 'max_history_length': FieldInfo(annotation=int, required=True), 'rewrite_query': FieldInfo(annotation=bool, required=True), 'template_contextualize': FieldInfo(annotation=str, required=False, default="\n    Given a chat history and the latest user question which might reference to context in the chat history, formulate a standalone question which can be understood without the chat history. Do NOT answer the question, return only reformulated question. Do NOT mention it is 'reformulated question', return only body of the question and nothing else.\n\n    {chat_history}\n\n    User question: {user_question}\n    "), 'template_header': FieldInfo(annotation=str, required=False, default='\nChat History:\n=============\n'), 'template_instruction': FieldInfo(annotation=str, required=False, default='When answering questions, take into consideration the history of the chat converastion, which is listed below under Chat History. The chat history is in reverse chronological order, so the most recent exhange is at the top.'), 'template_qa_pairs': FieldInfo(annotation=str, required=False, default='User: {question}\nAssistant: {answer}\n\n')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

rewrite_query: bool

Rewrite query for better context understanding

class llmsearch.config.Document(*, page_content: str, metadata: dict = None)

Interface for interacting with a document.

model_computed_fields = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields = {'metadata': FieldInfo(annotation=dict, required=False, default_factory=dict), 'page_content': FieldInfo(annotation=str, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class llmsearch.config.DocumentPathSettings(*, doc_path: Annotated[Path, PathType(path_type=dir)] | str, exclude_paths: List[Annotated[Path, PathType(path_type=dir)] | str] = None, scan_extensions: List[str], pdf_table_parser: PDFTableParser | None = None, pdf_image_parser: PDFImageParseSettings | None = None, additional_parser_settings: Dict[str, Any] = None, passage_prefix: str = '', label: str = '')

additional_parser_settings: Dict[str, Any]

Optional parser settings (parser dependent)

doc_path: Annotated[Path, PathType(path_type=dir)] | str

Defines document folder for a given document set.

exclude_paths: List[Annotated[Path, PathType(path_type=dir)] | str]

List of folders to exclude from scanning.

label: str

Optional label for the document set, will be included in the metadata.

model_computed_fields = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields = {'additional_parser_settings': FieldInfo(annotation=Dict[str, Any], required=False, default_factory=dict), 'doc_path': FieldInfo(annotation=Union[Annotated[Path, PathType], str], required=True), 'exclude_paths': FieldInfo(annotation=List[Union[Annotated[Path, PathType], str]], required=False, default_factory=list), 'label': FieldInfo(annotation=str, required=False, default=''), 'passage_prefix': FieldInfo(annotation=str, required=False, default=''), 'pdf_image_parser': FieldInfo(annotation=Union[PDFImageParseSettings, NoneType], required=False), 'pdf_table_parser': FieldInfo(annotation=Union[PDFTableParser, NoneType], required=False), 'scan_extensions': FieldInfo(annotation=List[str], required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

pdf_image_parser: PDFImageParseSettings | None

If enabled, will parse images in pdf files using a specific of a parser.

pdf_table_parser: PDFTableParser | None

If enabled, will parse tables in pdf files using a specific of a parser.

scan_extensions: List[str]

List of extensions to scan.

class llmsearch.config.EmbedddingsSpladeConfig(*, n_batch: int = 3)

model_computed_fields = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields = {'n_batch': FieldInfo(annotation=int, required=False, default=3)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class llmsearch.config.EmbeddingModel(*, type: EmbeddingModelType, model_name: str, additional_kwargs: dict = None)

model_computed_fields = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config = {'protected_namespaces': ()}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields = {'additional_kwargs': FieldInfo(annotation=dict, required=False, default_factory=dict), 'model_name': FieldInfo(annotation=str, required=True), 'type': FieldInfo(annotation=EmbeddingModelType, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class llmsearch.config.EmbeddingModelType(value)

class llmsearch.config.EmbeddingsConfig(*, embedding_model: ~llmsearch.config.EmbeddingModel = EmbeddingModel(type=<EmbeddingModelType.instruct: 'instruct'>, model_name='hkunlp/instructor-large', additional_kwargs={}), embeddings_path: ~pathlib.Annotated[~pathlib.Path, ~pydantic.types.PathType(path_type=dir)] | str, document_settings: ~typing.List[~llmsearch.config.DocumentPathSettings], chunk_sizes: ~typing.List[int] = [1024], splade_config: ~llmsearch.config.EmbedddingsSpladeConfig = EmbedddingsSpladeConfig(n_batch=5))

chunk_sizes: List[int]

List of chunk sizes for text chunking, supports multiples sizes.

document_settings: List[DocumentPathSettings]

Defines settings for one or more document sets.

embedding_model: EmbeddingModel

Specifies embedding model to use for dense embeddings.

embeddings_path: Annotated[Path, PathType(path_type=dir)] | str

Specifies output folder for embeddings.

property labels: List[str]

Returns list of labels in document settings

model_computed_fields = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields = {'chunk_sizes': FieldInfo(annotation=List[int], required=False, default=[1024]), 'document_settings': FieldInfo(annotation=List[DocumentPathSettings], required=True), 'embedding_model': FieldInfo(annotation=EmbeddingModel, required=False, default=EmbeddingModel(type=<EmbeddingModelType.instruct: 'instruct'>, model_name='hkunlp/instructor-large', additional_kwargs={})), 'embeddings_path': FieldInfo(annotation=Union[Annotated[Path, PathType], str], required=True), 'splade_config': FieldInfo(annotation=EmbedddingsSpladeConfig, required=False, default=EmbedddingsSpladeConfig(n_batch=5))}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

splade_config: EmbedddingsSpladeConfig

Specifies settings for sparse embeddings (SPLADE).

class llmsearch.config.HydeSettings(*, enabled: bool = False, hyde_prompt: str = 'Write a short passage to answer the question: {question}')

model_computed_fields = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields = {'enabled': FieldInfo(annotation=bool, required=False, default=False), 'hyde_prompt': FieldInfo(annotation=str, required=False, default='Write a short passage to answer the question: {question}')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class llmsearch.config.MultiQuerySettings(*, enabled: bool = False, multiquery_prompt: str = "You are a helpful assistant that generates multiple questions based on the source question.\n    Generate {n_versions} additional related questions related to: ```{question}```.\n    \n    Suggest only short questions without compound sentences. Suggest a variety of questions that cover different aspects of the topic.\n    Make sure they are complete questions, and that they are related to the original question.\n\n    Generated questions should be separated by newlines, but shouldn't be enumerated.\n    ", n_versions: int = 5)

model_computed_fields = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields = {'enabled': FieldInfo(annotation=bool, required=False, default=False), 'multiquery_prompt': FieldInfo(annotation=str, required=False, default="You are a helpful assistant that generates multiple questions based on the source question.\n    Generate {n_versions} additional related questions related to: ```{question}```.\n    \n    Suggest only short questions without compound sentences. Suggest a variety of questions that cover different aspects of the topic.\n    Make sure they are complete questions, and that they are related to the original question.\n\n    Generated questions should be separated by newlines, but shouldn't be enumerated.\n    "), 'n_versions': FieldInfo(annotation=int, required=False, default=5)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class llmsearch.config.ObsidianAdvancedURI(*, append_heading_template: str)

model_computed_fields = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields = {'append_heading_template': FieldInfo(annotation=str, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class llmsearch.config.PDFImageParseSettings(*, image_parser: PDFImageParser, system_instruction: str = "You are an research assistant. You analyze the image to extract detailed information. Response must be a Markdown string in the follwing format:\n- First line is a heading with image caption, starting with '# '\n- Second line is empty\n- From the third line on - detailed data points and related metadata, extracted from the image, in Markdown format. Don't use Markdown tables. \n", user_instruction: str = 'From the image, extract detailed quantitative and qualitative data points.')

model_computed_fields = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields = {'image_parser': FieldInfo(annotation=PDFImageParser, required=True), 'system_instruction': FieldInfo(annotation=str, required=False, default="You are an research assistant. You analyze the image to extract detailed information. Response must be a Markdown string in the follwing format:\n- First line is a heading with image caption, starting with '# '\n- Second line is empty\n- From the third line on - detailed data points and related metadata, extracted from the image, in Markdown format. Don't use Markdown tables. \n"), 'user_instruction': FieldInfo(annotation=str, required=False, default='From the image, extract detailed quantitative and qualitative data points.')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class llmsearch.config.PDFImageParser(value)

class llmsearch.config.PDFTableParser(value)

class llmsearch.config.ReplaceOutputPath(*, substring_search: str, substring_replace: str)

model_computed_fields = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields = {'substring_replace': FieldInfo(annotation=str, required=True), 'substring_search': FieldInfo(annotation=str, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class llmsearch.config.RerankerModel(value)

class llmsearch.config.RerankerSettings(*, enabled: bool = True, model: RerankerModel = RerankerModel.BGE_RERANKER)

model_computed_fields = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields = {'enabled': FieldInfo(annotation=bool, required=False, default=True), 'model': FieldInfo(annotation=RerankerModel, required=False, default=<RerankerModel.BGE_RERANKER: 'bge'>)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class llmsearch.config.ResponseModel(*, id: UUID = None, question: str, response: str, average_score: float, semantic_search: List[SemanticSearchOutput] = None, hyde_response: str = '')

model_computed_fields = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields = {'average_score': FieldInfo(annotation=float, required=True), 'hyde_response': FieldInfo(annotation=str, required=False, default=''), 'id': FieldInfo(annotation=UUID, required=False, default_factory=create_uuid), 'question': FieldInfo(annotation=str, required=True), 'response': FieldInfo(annotation=str, required=True), 'semantic_search': FieldInfo(annotation=List[SemanticSearchOutput], required=False, default_factory=list)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class llmsearch.config.SemanticSearchConfig(*, search_type: ~typing.Literal['mmr', 'similarity'], replace_output_path: ~typing.List[~llmsearch.config.ReplaceOutputPath] = None, obsidian_advanced_uri: ~llmsearch.config.ObsidianAdvancedURI | None = None, append_suffix: ~llmsearch.config.SuffixAppend | None = None, reranker: ~llmsearch.config.RerankerSettings = RerankerSettings(enabled=True, model=<RerankerModel.BGE_RERANKER: 'bge'>), max_k: int = 15, score_cutoff: float | None = None, max_char_size: int = 16384, query_prefix: str = '', hyde: ~llmsearch.config.HydeSettings = HydeSettings(enabled=False, hyde_prompt='Write a short passage to answer the question: {question}'), multiquery: ~llmsearch.config.MultiQuerySettings = MultiQuerySettings(enabled=False, multiquery_prompt="You are a helpful assistant that generates multiple questions based on the source question.\n    Generate {n_versions} additional related questions related to: ```{question}```.\n    \n    Suggest only short questions without compound sentences. Suggest a variety of questions that cover different aspects of the topic.\n    Make sure they are complete questions, and that they are related to the original question.\n\n    Generated questions should be separated by newlines, but shouldn't be enumerated.\n    ", n_versions=5), conversation_history_settings: ~llmsearch.config.ConversrationHistorySettings = ConversrationHistorySettings(enabled=False, max_history_length=2, rewrite_query=True, history=[], template_instruction='When answering questions, take into consideration the history of the chat converastion, which is listed below under Chat History. The chat history is in reverse chronological order, so the most recent exhange is at the top.', template_contextualize="\n    Given a chat history and the latest user question which might reference to context in the chat history, formulate a standalone question which can be understood without the chat history. Do NOT answer the question, return only reformulated question. Do NOT mention it is 'reformulated question', return only body of the question and nothing else.\n\n    {chat_history}\n\n    User question: {user_question}\n    ", template_header='\nChat History:\n=============\n', template_qa_pairs='User: {question}\nAssistant: {answer}\n\n'))

append_suffix: SuffixAppend | None

Allows to append suffix to document URL. Useful for deep linking to allow opening with external application, e.g. Obsidian.

conversation_history_settings: ConversrationHistorySettings

Conversation history

hyde: HydeSettings

Optional configuration for HyDE.

max_char_size: int

Maximum character size for query + documents to fit into context window of LLM.

max_k: int

Maximum number of documents to retrieve for dense OR sparse embedding (if using both, number of documents will be k*2)

model_computed_fields = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config = {'arbitrary_types_allowed': True, 'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields = {'append_suffix': FieldInfo(annotation=Union[SuffixAppend, NoneType], required=False), 'conversation_history_settings': FieldInfo(annotation=ConversrationHistorySettings, required=False, default=ConversrationHistorySettings(enabled=False, max_history_length=2, rewrite_query=True, history=[], template_instruction='When answering questions, take into consideration the history of the chat converastion, which is listed below under Chat History. The chat history is in reverse chronological order, so the most recent exhange is at the top.', template_contextualize="\n    Given a chat history and the latest user question which might reference to context in the chat history, formulate a standalone question which can be understood without the chat history. Do NOT answer the question, return only reformulated question. Do NOT mention it is 'reformulated question', return only body of the question and nothing else.\n\n    {chat_history}\n\n    User question: {user_question}\n    ", template_header='\nChat History:\n=============\n', template_qa_pairs='User: {question}\nAssistant: {answer}\n\n')), 'hyde': FieldInfo(annotation=HydeSettings, required=False, default=HydeSettings(enabled=False, hyde_prompt='Write a short passage to answer the question: {question}')), 'max_char_size': FieldInfo(annotation=int, required=False, default=16384), 'max_k': FieldInfo(annotation=int, required=False, default=15), 'multiquery': FieldInfo(annotation=MultiQuerySettings, required=False, default=MultiQuerySettings(enabled=False, multiquery_prompt="You are a helpful assistant that generates multiple questions based on the source question.\n    Generate {n_versions} additional related questions related to: ```{question}```.\n    \n    Suggest only short questions without compound sentences. Suggest a variety of questions that cover different aspects of the topic.\n    Make sure they are complete questions, and that they are related to the original question.\n\n    Generated questions should be separated by newlines, but shouldn't be enumerated.\n    ", n_versions=5)), 'obsidian_advanced_uri': FieldInfo(annotation=Union[ObsidianAdvancedURI, NoneType], required=False), 'query_prefix': FieldInfo(annotation=str, required=False, default=''), 'replace_output_path': FieldInfo(annotation=List[ReplaceOutputPath], required=False, default_factory=list), 'reranker': FieldInfo(annotation=RerankerSettings, required=False, default=RerankerSettings(enabled=True, model=<RerankerModel.BGE_RERANKER: 'bge'>)), 'score_cutoff': FieldInfo(annotation=Union[float, NoneType], required=False), 'search_type': FieldInfo(annotation=Literal['mmr', 'similarity'], required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

multiquery: MultiQuerySettings

Optional configuration for multi-query

query_prefix: str

Prefix query with string BEFORE retrieval using embedding model.

reranker: RerankerSettings

Configures re-ranker settings.

score_cutoff: float | None

Documents with score less than specified will be excluded from relevant documents

search_type: Literal['mmr', 'similarity']

Configure search type, currently only similarity can be used.

class llmsearch.config.SemanticSearchOutput(*, chunk_link: str, chunk_text: str, metadata: dict)

model_computed_fields = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields = {'chunk_link': FieldInfo(annotation=str, required=True), 'chunk_text': FieldInfo(annotation=str, required=True), 'metadata': FieldInfo(annotation=dict, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

llmsearch.config.load_yaml_file(config) → dict

Loads YAML file or string and returns a dictionary