Search Clients

In order to search the PAW, you will likely want to use a dedicated search client. See Using ElasticWrap for more info.

CommitClient

class elasticwrap.search_clients.commits.CommitClient(client: Elasticsearch)

Bases: SearchInterface[Document[PercevalSourceDocument[CommitSourceData]]]

A SearchInterface implementation focused on git commits.

The namespace for this client is “_git” the logname is “Elasticwrap_Commits”

stored_date_format

The date format used for most date fields in commit documents

Type

str

updated_on_format

The date format used specifically for the updated_on fields.

Type

str

See also

SearchInterface

The interface implemented by this client

class DateEnum(value)

Bases: Enum

An Enum used for type checking purposes when filtering by date.

This Enum holds values for all the possible date fields.

AUTHOR_DATE = 'data.AuthorDate'
COMMIT_DATE = 'data.CommitDate'
META_UPDATED_ON = 'metadata__updated_on'
anonymize_data(data: PercevalSourceDocument[CommitSourceData])

Implements the anonymize_data method. In the context of git commits, this means removing the following fields:

  • Author

  • Commit

  • Signed-off-by

  • message

  • commit

  • parents

And reformatting the following fields to only yyyy-mm:

  • AuthorDate

  • CommitDate

  • metadata__updated_on

  • updated_on (more specifically, this field is only kept if emtadata__updated_on exists and will be constrained to the correct unix timestamp)

Parameters

data (CommitSourceDocument) – The data to anonymize.

filter_after_date(field: DateEnum, dat: Union[str, datetime], inclusive=True)

Only returns documents from after a certain date

Parameters
  • field (DateEnum) – The field to filter on.

  • dat (str | datetime) – The date/string representation for the date to compare to.

  • inclusive (bool, optional) – Whether to include commits that fall exactly on the date, or not. Defaults to true

See also

SearchInterface.filter_date, DateEnum

filter_before_date(field: DateEnum, dat: Union[str, datetime], inclusive=True)

Only returns documents from before a certain date

Parameters
  • field (DateEnum) – The field to filter on.

  • dat (str | datetime) – The date/string representation for the date to compare to.

  • inclusive (bool, optional) – Whether to include commits that fall exactly on the date, or not. Defaults to true

See also

SearchInterface.filter_date, DateEnum

filter_on_hash(commits: Union[Set[str], List[str]])

Add a filter which only allows certain commits identified by their hash.

Parameters

commits (set[str] | list[str]) – The list of hashes to search on.

Commit Types

Types defined for the commit client

elasticwrap.search_clients.commits.CommitDocument

alias of Document[PercevalSourceDocument[CommitSourceData]]

elasticwrap.search_clients.commits.CommitSourceDocument

alias of PercevalSourceDocument[CommitSourceData]

class elasticwrap.search_clients.commits.CommitSourceData(_typename, _fields=None, /, **kwargs)

Bases: dict

The data of a commit as returned by Perceval.

Author: str
AuthorDate: str
Commit: str
CommitDate: str
commit: typing_extensions.Required[str]
files: typing_extensions.Required[List[CommitFileInfo]]
message: str
parents: List[str]
refs: List[str]
class elasticwrap.search_clients.commits.CommitFileInfo(_typename, _fields=None, /, **kwargs)

Bases: dict

Info for a commit as it relates to a file

action: str
added: str
file: typing_extensions.Required[str]
indexes: List[str]
modes: typing_extensions.Required[List[Literal['100644', '100755', '120000', '160000', '040000', '030000']]]
removed: str

GenericSearchClient

class elasticwrap.search_clients.search.GenericSearchClient(client: Elasticsearch, name: str, log_name: str = 'ElasticWrap_Search')

Bases: SearchInterface[Document[Mapping[str, Any]]]

A generic search client for abitrary documents. Can be used to quickly use the generic search features without having knowledge about the document structure.

anonymize_data(_)

Does nothing.

UnstructuredSearchClient

class elasticwrap.search_clients.search.UnstructuredSearchClient(*args, **kwargs)

Bases: GenericSearchClient

A generic search client which does not think in terms of “projects” and “namespaces”

Instead of providing a list of project names, you can directly provide a list of indices.

Intended when you simply want to use ElasticWrap without any of the PAW assumptions.

SearchInterface

class elasticwrap.search_clients.search.SearchInterface(client: Elasticsearch, name: str, log_name: str = 'ElasticWrap_Search')

Bases: Generic[AnyDocument], NamespacedClient

A Client that is focused on searching. Should be implemented by more specific classes which know the structure of the data they are searching through and what the user is likely to want to search for.

In general, a user can construct their search using methods found in this interface + the implementing classes. Once they are done constructing the search, they can retrieve the data.

Implements a number of default methods which are useful for all search clients.

canonical_date_es_format

The format to send to ES which uses a different formatting method to Python, set to yyyy-MM-dd

Type

str

canonical_date_format

The format that dates should be presented in, set to %Y-%m-%d

Type

str

logger

Every search client should have a logger. Ideally, one with a unique name is used, but if none is provided, SearchInterface comes with a default logger.

Type

Logger

namespace

The namespace for a particular search client. This namespace defines on what indices the client will look by default (for example, “git” for the CommitClient).

Indices are intended to follow the naming scheme {projectname}_{namespace}.

Type

str

query

A dictionary that holds all the arguments that will be sent to elasticsearch. Usually, an end user should have no reason to see or interact with this dictionary.

Type

Dict[str, Any]

See also

CommitClient

Search Client focused on Git commits

Notes

The SearchInterface keeps track of what searches are being done with it and does not automatically clear out it’s arguments. Meaning the subsequent searches will always yields the same results. If you want to reset the search, you can use the reset() method.

SearchInterface is defined as generic for a specific document structure. Implementations are encouraged to define a TypedDict representation of the structure of the documents they deal with and to specify this representation on inheritence.

add_accepted_fields(fields: Set[Union[str, Dict]]) None

Used to filter out what fields to return when searching.

This does not filter out documents, it specifically filters out the fields in the returned documents. If you want to filter out documents, use add_filters.

By default all fields are returned, as soon as fields are added using this method, only the accepted fields will be returned.

Parameters

fields (set[str | Dict]) – The fields to return when searching. If the field you want to add is nested deeper, a dict must be provided.

See also

add_filters

add_filters(filters: List[Tuple[str, Union[Sequence[Union[str, int, float, datetime, bool]], AbstractSet[Union[str, int, float, datetime, bool]]]]], negate=False) None

Adds a filter to the search. Any document that is returned by the search must adhere to this filter.

Parameters
  • filters (List[Tuple[str, ElasticPrimitiveList]]) – A list of filters to return. A filter takes the format (term, match) which means that the value of term must be found in the match list

  • negate (bool) – If True, will instead ensure that the values of term are NOT found in the match list. Defaults to False

add_range_filter(term: str, rang: Dict[str, str]) None

Adds a filter to the search. Any document which does not fall into a certain range will be removed from the search.

Parameters
  • term (str) – The term to compare with

  • rang (Dict[str, str]) – A dictionary representing the range object, as defined in the ES docs[1]_

  • https ([1] Elastic co. "Range Query") –

abstract anonymize_data(data: SourceDocument) None

Stub, should be specified per search client depending on the structure of the data. This method will make sure that the provided document (in the form of a dict) is stripped of all relevant PII. This happens in-place and is irreversible.

Parameters

data (SourceDocument) – The document to anonymize

Raises

NotImplementedError – Will raise this error by default if the search client has not implemented this method.

del_accepted_fields(fields: Set[Union[str, Dict]]) None

Used to remove a set of fields from the set of accepted fields.

This is only used to remove fields that were added to the accepted list using add_accepted_fields. Elasticsearch does not allow you to remove some fields from the results and keep the rest.

Parameters

fields (set[str | Dict]) – The fields to no longer return when searching.

del_filters(filters: List[str]) None

Remove all filters for a specific term.

Parameters

filters (list[str]) – The terms to delete the filters from.

See also

add_filters

del_range_filter(term: str) None

Remove all range filters for a specific term.

Parameters

term (str) – The term to delete the filter from

See also

add_range_filter

del_script_filter()

Removes the filter script, if it exists

filter_date(term: str, dates: List[Tuple[LogicalComp, Union[str, datetime]]])

Removes documents for which the value of some term is outside some date range.

Parameters
  • term (str) – The term to filter on

  • dates (list[tuple[LogicalComp, str | datetime]]) – A list of tuples where the first value is the comparison and the second value is the value to compare to.

Raises
  • ParserError – If dateutil was unable to parse the provided date string.:

  • OverflowError – If the provided date string would exceed C MAXINT (impressive).:

Notes

If any of the provided dates in a tuple is a string, it will be attempted to be parsed using the dateutil python library.

get_all(anonymize: bool = False, **kwargs) Iterator[List[AnyDocument]]

Gets all documents for the associated namespace in the PAW, ignoring all the currently set filters.

Parameters
  • anonymize (bool, optional) – Whether to anonymize the returned data, defaults to false.

  • **kwargs – Additional arguments to send to the search method

Returns

A lazily paginated list of all the documents

Return type

Iterator[List[AnyDocument]]

See also

search

reset() Dict[str, Any]

Resets the current search to simply getting all documents and sorting by whatever ES decides is correct.

Returns

The query dict for inspection. Automatically sets self.query as well.

Return type

Dict[str, Any]

search(proj: List[str] = [], anonymize: bool = False, **kwargs) Iterator[List[AnyDocument]]

Start searching with the constructed query.

Parameters
  • proj (list[str], optional) – The list of projects to search through

  • anonymize (bool, optional) – Whether to anonymize the returned data, defaults to false

  • **kwargs – Additional keyword arguments to pass along to the vanilla elasticsearch search.

Returns

A lazily paginated list of found documents. Will by default run in chunks of 1000 documents. This means that every time next is called, another, at most, 1000 documents are searched for and returned.

Return type

Iterator[List[AnyDocument]]

Notes

Just like with the vanilla elasticsearch search, a query dict can be passed as a keyword argument. This dict can have different search constraints from those constructed by the search client. Before the search, both sets of search constraints are merged together. If both the provided dict and the constructed dict try to add conflicting constraints, the constraints from the provided dict are used.

This also means that the search construction can be circumvented entirely and the interface search can be directly used with a query object. This could be useful if you want to make use of the automatic pagination provided by this interface, but would prefer to construct the query yourself.

It is possible that the query you pass along manually conflicts in a more fundamental way with the constructed query (for example, combining match_all and a bool filter causes issues). Therefore it is recommended to only combine the two sources if you know what you are doing.

See also

elasticwrap.helpers.search_pit, elasticsearch.search

set_script_filter(script: str)

Sets a script for filtering.

Notes

This should really only be used for comparing 2 fields in the same document, I.E. doc.val1 != doc.val2. Every other case can be handled better in other ways.

For more information see: For more information, see https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-script-query.html