Search Clients

In order to search the PAW, you will likely want to use a dedicated search client. See Using ElasticWrap for more info.

CommitClient

class elasticwrap.search_clients.commits.CommitClient(client: Elasticsearch)

Bases: SearchInterface[Document[PercevalSourceDocument[CommitSourceData]]]

A SearchInterface implementation focused on git commits.

The namespace for this client is “_git” the logname is “Elasticwrap_Commits”

stored_date_format

The date format used for most date fields in commit documents

Type: str

updated_on_format

The date format used specifically for the updated_on fields.

Type: str

See also

SearchInterface: The interface implemented by this client

class DateEnum(value)

Bases: Enum

An Enum used for type checking purposes when filtering by date.

This Enum holds values for all the possible date fields.

AUTHOR_DATE = 'data.AuthorDate'

COMMIT_DATE = 'data.CommitDate'

META_UPDATED_ON = 'metadata__updated_on'

anonymize_data(data: PercevalSourceDocument[CommitSourceData])

Implements the anonymize_data method. In the context of git commits, this means removing the following fields:

Author
Commit
Signed-off-by
message
commit
parents

And reformatting the following fields to only yyyy-mm:

AuthorDate
CommitDate
metadata__updated_on
updated_on (more specifically, this field is only kept if emtadata__updated_on exists and will be constrained to the correct unix timestamp)

Parameters: data (CommitSourceDocument) – The data to anonymize.

filter_after_date(field: DateEnum, dat: Union[str, datetime], inclusive=True)

Only returns documents from after a certain date

Parameters

field (DateEnum) – The field to filter on.
dat (str | datetime) – The date/string representation for the date to compare to.
inclusive (bool, optional) – Whether to include commits that fall exactly on the date, or not. Defaults to true

See also

SearchInterface.filter_date, DateEnum

filter_before_date(field: DateEnum, dat: Union[str, datetime], inclusive=True)

Only returns documents from before a certain date

Parameters

field (DateEnum) – The field to filter on.
dat (str | datetime) – The date/string representation for the date to compare to.
inclusive (bool, optional) – Whether to include commits that fall exactly on the date, or not. Defaults to true

See also

SearchInterface.filter_date, DateEnum

filter_on_hash(commits: Union[Set[str], List[str]])

Add a filter which only allows certain commits identified by their hash.

Parameters: commits (set[str] | list[str]) – The list of hashes to search on.

Commit Types

Types defined for the commit client

elasticwrap.search_clients.commits.CommitDocument: alias of Document[PercevalSourceDocument[CommitSourceData]]

elasticwrap.search_clients.commits.CommitSourceDocument: alias of PercevalSourceDocument[CommitSourceData]

class elasticwrap.search_clients.commits.CommitSourceData(_typename, _fields=None, /, **kwargs)

Bases: dict

The data of a commit as returned by Perceval.

Author: str

AuthorDate: str

Commit: str

CommitDate: str

commit: typing_extensions.Required[str]

files: typing_extensions.Required[List[CommitFileInfo]]

message: str

parents: List[str]

refs: List[str]

class elasticwrap.search_clients.commits.CommitFileInfo(_typename, _fields=None, /, **kwargs)

Bases: dict

Info for a commit as it relates to a file

action: str

added: str

file: typing_extensions.Required[str]

indexes: List[str]

modes: typing_extensions.Required[List[Literal['100644', '100755', '120000', '160000', '040000', '030000']]]

removed: str

GenericSearchClient

class elasticwrap.search_clients.search.GenericSearchClient(client: Elasticsearch, name: str, log_name: str = 'ElasticWrap_Search')

Bases: SearchInterface[Document[Mapping[str, Any]]]

A generic search client for abitrary documents. Can be used to quickly use the generic search features without having knowledge about the document structure.

anonymize_data(_): Does nothing.

UnstructuredSearchClient

class elasticwrap.search_clients.search.UnstructuredSearchClient(*args, **kwargs)

Bases: GenericSearchClient

A generic search client which does not think in terms of “projects” and “namespaces”

Instead of providing a list of project names, you can directly provide a list of indices.

Intended when you simply want to use ElasticWrap without any of the PAW assumptions.

SearchInterface

class elasticwrap.search_clients.search.SearchInterface(client: Elasticsearch, name: str, log_name: str = 'ElasticWrap_Search')

Bases: Generic[AnyDocument], NamespacedClient

A Client that is focused on searching. Should be implemented by more specific classes which know the structure of the data they are searching through and what the user is likely to want to search for.

In general, a user can construct their search using methods found in this interface + the implementing classes. Once they are done constructing the search, they can retrieve the data.

Implements a number of default methods which are useful for all search clients.

canonical_date_es_format

The format to send to ES which uses a different formatting method to Python, set to yyyy-MM-dd

Type: str

canonical_date_format

The format that dates should be presented in, set to %Y-%m-%d

Type: str

logger

Every search client should have a logger. Ideally, one with a unique name is used, but if none is provided, SearchInterface comes with a default logger.

Type: Logger

namespace

The namespace for a particular search client. This namespace defines on what indices the client will look by default (for example, “git” for the CommitClient).

Indices are intended to follow the naming scheme {projectname}_{namespace}.

Type: str

query

A dictionary that holds all the arguments that will be sent to elasticsearch. Usually, an end user should have no reason to see or interact with this dictionary.

Type: Dict[str, Any]

See also

CommitClient: Search Client focused on Git commits

Notes

The SearchInterface keeps track of what searches are being done with it and does not automatically clear out it’s arguments. Meaning the subsequent searches will always yields the same results. If you want to reset the search, you can use the reset() method.

SearchInterface is defined as generic for a specific document structure. Implementations are encouraged to define a TypedDict representation of the structure of the documents they deal with and to specify this representation on inheritence.

add_accepted_fields(fields: Set[Union[str, Dict]]) → None

Used to filter out what fields to return when searching.

This does not filter out documents, it specifically filters out the fields in the returned documents. If you want to filter out documents, use add_filters.

By default all fields are returned, as soon as fields are added using this method, only the accepted fields will be returned.

Parameters: fields (set[str | Dict]) – The fields to return when searching. If the field you want to add is nested deeper, a dict must be provided.

See also

add_filters

add_filters(filters: List[Tuple[str, Union[Sequence[Union[str, int, float, datetime, bool]], AbstractSet[Union[str, int, float, datetime, bool]]]]], negate=False) → None

Adds a filter to the search. Any document that is returned by the search must adhere to this filter.

Parameters

filters (List[Tuple[str, ElasticPrimitiveList]]) – A list of filters to return. A filter takes the format (term, match) which means that the value of term must be found in the match list
negate (bool) – If True, will instead ensure that the values of term are NOT found in the match list. Defaults to False

add_range_filter(term: str, rang: Dict[str, str]) → None

Adds a filter to the search. Any document which does not fall into a certain range will be removed from the search.

Parameters

term (str) – The term to compare with
rang (Dict[str, str]) – A dictionary representing the range object, as defined in the ES docs[1]_
https ([1] Elastic co. "Range Query") –

abstract anonymize_data(data: SourceDocument) → None

Stub, should be specified per search client depending on the structure of the data. This method will make sure that the provided document (in the form of a dict) is stripped of all relevant PII. This happens in-place and is irreversible.

Parameters: data (SourceDocument) – The document to anonymize
Raises: NotImplementedError – Will raise this error by default if the search client has not implemented this method.

del_accepted_fields(fields: Set[Union[str, Dict]]) → None

Used to remove a set of fields from the set of accepted fields.

This is only used to remove fields that were added to the accepted list using add_accepted_fields. Elasticsearch does not allow you to remove some fields from the results and keep the rest.

Parameters: fields (set[str | Dict]) – The fields to no longer return when searching.

See also

add_accepted_fields

del_filters(filters: List[str]) → None

Remove all filters for a specific term.

Parameters: filters (list[str]) – The terms to delete the filters from.

See also

add_filters

del_range_filter(term: str) → None

Remove all range filters for a specific term.

Parameters: term (str) – The term to delete the filter from

See also

add_range_filter

del_script_filter(): Removes the filter script, if it exists

filter_date(term: str, dates: List[Tuple[LogicalComp, Union[str, datetime]]])

Removes documents for which the value of some term is outside some date range.

Parameters

term (str) – The term to filter on
dates (list[tuple[LogicalComp, str | datetime]]) – A list of tuples where the first value is the comparison and the second value is the value to compare to.

Raises

ParserError – If dateutil was unable to parse the provided date string.:
OverflowError – If the provided date string would exceed C MAXINT (impressive).:

Notes

If any of the provided dates in a tuple is a string, it will be attempted to be parsed using the dateutil python library.

get_all(anonymize: bool = False, **kwargs) → Iterator[List[AnyDocument]]

Gets all documents for the associated namespace in the PAW, ignoring all the currently set filters.

Parameters

anonymize (bool, optional) – Whether to anonymize the returned data, defaults to false.
**kwargs – Additional arguments to send to the search method

Returns

A lazily paginated list of all the documents

Return type

Iterator[List[AnyDocument]]

See also

search

reset() → Dict[str, Any]

Resets the current search to simply getting all documents and sorting by whatever ES decides is correct.

Returns: The query dict for inspection. Automatically sets self.query as well.
Return type: Dict[str, Any]

search(proj: List[str] = [], anonymize: bool = False, **kwargs) → Iterator[List[AnyDocument]]

Start searching with the constructed query.

Parameters

proj (list[str], optional) – The list of projects to search through
anonymize (bool, optional) – Whether to anonymize the returned data, defaults to false
**kwargs – Additional keyword arguments to pass along to the vanilla elasticsearch search.

Returns

A lazily paginated list of found documents. Will by default run in chunks of 1000 documents. This means that every time next is called, another, at most, 1000 documents are searched for and returned.

Return type

Iterator[List[AnyDocument]]

Notes

Just like with the vanilla elasticsearch search, a query dict can be passed as a keyword argument. This dict can have different search constraints from those constructed by the search client. Before the search, both sets of search constraints are merged together. If both the provided dict and the constructed dict try to add conflicting constraints, the constraints from the provided dict are used.

This also means that the search construction can be circumvented entirely and the interface search can be directly used with a query object. This could be useful if you want to make use of the automatic pagination provided by this interface, but would prefer to construct the query yourself.

It is possible that the query you pass along manually conflicts in a more fundamental way with the constructed query (for example, combining match_all and a bool filter causes issues). Therefore it is recommended to only combine the two sources if you know what you are doing.

See also

elasticwrap.helpers.search_pit, elasticsearch.search

set_script_filter(script: str)

Sets a script for filtering.

Notes

This should really only be used for comparing 2 fields in the same document, I.E. doc.val1 != doc.val2. Every other case can be handled better in other ways.

For more information see: For more information, see https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-script-query.html