Search Clients
In order to search the PAW, you will likely want to use a dedicated search client. See Using ElasticWrap for more info.
CommitClient
- class elasticwrap.search_clients.commits.CommitClient(client: Elasticsearch)
Bases:
SearchInterface
[Document
[PercevalSourceDocument
[CommitSourceData
]]]A SearchInterface implementation focused on git commits.
The namespace for this client is “_git” the logname is “Elasticwrap_Commits”
- stored_date_format
The date format used for most date fields in commit documents
- Type
str
- updated_on_format
The date format used specifically for the updated_on fields.
- Type
str
See also
SearchInterface
The interface implemented by this client
- class DateEnum(value)
Bases:
Enum
An Enum used for type checking purposes when filtering by date.
This Enum holds values for all the possible date fields.
- AUTHOR_DATE = 'data.AuthorDate'
- COMMIT_DATE = 'data.CommitDate'
- META_UPDATED_ON = 'metadata__updated_on'
- anonymize_data(data: PercevalSourceDocument[CommitSourceData])
Implements the anonymize_data method. In the context of git commits, this means removing the following fields:
Author
Commit
Signed-off-by
message
commit
parents
And reformatting the following fields to only yyyy-mm:
AuthorDate
CommitDate
metadata__updated_on
updated_on (more specifically, this field is only kept if emtadata__updated_on exists and will be constrained to the correct unix timestamp)
- Parameters
data (CommitSourceDocument) – The data to anonymize.
- filter_after_date(field: DateEnum, dat: Union[str, datetime], inclusive=True)
Only returns documents from after a certain date
- Parameters
field (DateEnum) – The field to filter on.
dat (str | datetime) – The date/string representation for the date to compare to.
inclusive (bool, optional) – Whether to include commits that fall exactly on the date, or not. Defaults to true
See also
SearchInterface.filter_date
,DateEnum
- filter_before_date(field: DateEnum, dat: Union[str, datetime], inclusive=True)
Only returns documents from before a certain date
- Parameters
field (DateEnum) – The field to filter on.
dat (str | datetime) – The date/string representation for the date to compare to.
inclusive (bool, optional) – Whether to include commits that fall exactly on the date, or not. Defaults to true
See also
SearchInterface.filter_date
,DateEnum
- filter_on_hash(commits: Union[Set[str], List[str]])
Add a filter which only allows certain commits identified by their hash.
- Parameters
commits (set[str] | list[str]) – The list of hashes to search on.
Commit Types
Types defined for the commit client
- elasticwrap.search_clients.commits.CommitDocument
alias of
Document
[PercevalSourceDocument
[CommitSourceData
]]
- elasticwrap.search_clients.commits.CommitSourceDocument
alias of
PercevalSourceDocument
[CommitSourceData
]
- class elasticwrap.search_clients.commits.CommitSourceData(_typename, _fields=None, /, **kwargs)
Bases:
dict
The data of a commit as returned by Perceval.
- Author: str
- AuthorDate: str
- Commit: str
- CommitDate: str
- commit: typing_extensions.Required[str]
- files: typing_extensions.Required[List[CommitFileInfo]]
- message: str
- parents: List[str]
- refs: List[str]
- class elasticwrap.search_clients.commits.CommitFileInfo(_typename, _fields=None, /, **kwargs)
Bases:
dict
Info for a commit as it relates to a file
- action: str
- added: str
- file: typing_extensions.Required[str]
- indexes: List[str]
- modes: typing_extensions.Required[List[Literal['100644', '100755', '120000', '160000', '040000', '030000']]]
- removed: str
GenericSearchClient
- class elasticwrap.search_clients.search.GenericSearchClient(client: Elasticsearch, name: str, log_name: str = 'ElasticWrap_Search')
Bases:
SearchInterface
[Document
[Mapping
[str
,Any
]]]A generic search client for abitrary documents. Can be used to quickly use the generic search features without having knowledge about the document structure.
- anonymize_data(_)
Does nothing.
UnstructuredSearchClient
- class elasticwrap.search_clients.search.UnstructuredSearchClient(*args, **kwargs)
Bases:
GenericSearchClient
A generic search client which does not think in terms of “projects” and “namespaces”
Instead of providing a list of project names, you can directly provide a list of indices.
Intended when you simply want to use ElasticWrap without any of the PAW assumptions.
SearchInterface
- class elasticwrap.search_clients.search.SearchInterface(client: Elasticsearch, name: str, log_name: str = 'ElasticWrap_Search')
Bases:
Generic
[AnyDocument
],NamespacedClient
A Client that is focused on searching. Should be implemented by more specific classes which know the structure of the data they are searching through and what the user is likely to want to search for.
In general, a user can construct their search using methods found in this interface + the implementing classes. Once they are done constructing the search, they can retrieve the data.
Implements a number of default methods which are useful for all search clients.
- canonical_date_es_format
The format to send to ES which uses a different formatting method to Python, set to yyyy-MM-dd
- Type
str
- canonical_date_format
The format that dates should be presented in, set to %Y-%m-%d
- Type
str
- logger
Every search client should have a logger. Ideally, one with a unique name is used, but if none is provided, SearchInterface comes with a default logger.
- Type
Logger
- namespace
The namespace for a particular search client. This namespace defines on what indices the client will look by default (for example, “git” for the CommitClient).
Indices are intended to follow the naming scheme {projectname}_{namespace}.
- Type
str
- query
A dictionary that holds all the arguments that will be sent to elasticsearch. Usually, an end user should have no reason to see or interact with this dictionary.
- Type
Dict[str, Any]
See also
CommitClient
Search Client focused on Git commits
Notes
The SearchInterface keeps track of what searches are being done with it and does not automatically clear out it’s arguments. Meaning the subsequent searches will always yields the same results. If you want to reset the search, you can use the reset() method.
SearchInterface is defined as generic for a specific document structure. Implementations are encouraged to define a TypedDict representation of the structure of the documents they deal with and to specify this representation on inheritence.
- add_accepted_fields(fields: Set[Union[str, Dict]]) None
Used to filter out what fields to return when searching.
This does not filter out documents, it specifically filters out the fields in the returned documents. If you want to filter out documents, use add_filters.
By default all fields are returned, as soon as fields are added using this method, only the accepted fields will be returned.
- Parameters
fields (set[str | Dict]) – The fields to return when searching. If the field you want to add is nested deeper, a dict must be provided.
See also
- add_filters(filters: List[Tuple[str, Union[Sequence[Union[str, int, float, datetime, bool]], AbstractSet[Union[str, int, float, datetime, bool]]]]], negate=False) None
Adds a filter to the search. Any document that is returned by the search must adhere to this filter.
- Parameters
filters (List[Tuple[str, ElasticPrimitiveList]]) – A list of filters to return. A filter takes the format (term, match) which means that the value of term must be found in the match list
negate (bool) – If
True
, will instead ensure that the values ofterm
are NOT found in thematch
list. Defaults toFalse
- add_range_filter(term: str, rang: Dict[str, str]) None
Adds a filter to the search. Any document which does not fall into a certain range will be removed from the search.
- Parameters
term (str) – The term to compare with
rang (Dict[str, str]) – A dictionary representing the range object, as defined in the ES docs[1]_
https ([1] Elastic co. "Range Query") –
- abstract anonymize_data(data: SourceDocument) None
Stub, should be specified per search client depending on the structure of the data. This method will make sure that the provided document (in the form of a dict) is stripped of all relevant PII. This happens in-place and is irreversible.
- Parameters
data (SourceDocument) – The document to anonymize
- Raises
NotImplementedError – Will raise this error by default if the search client has not implemented this method.
- del_accepted_fields(fields: Set[Union[str, Dict]]) None
Used to remove a set of fields from the set of accepted fields.
This is only used to remove fields that were added to the accepted list using add_accepted_fields. Elasticsearch does not allow you to remove some fields from the results and keep the rest.
- Parameters
fields (set[str | Dict]) – The fields to no longer return when searching.
See also
- del_filters(filters: List[str]) None
Remove all filters for a specific term.
- Parameters
filters (list[str]) – The terms to delete the filters from.
See also
- del_range_filter(term: str) None
Remove all range filters for a specific term.
- Parameters
term (str) – The term to delete the filter from
See also
- del_script_filter()
Removes the filter script, if it exists
- filter_date(term: str, dates: List[Tuple[LogicalComp, Union[str, datetime]]])
Removes documents for which the value of some term is outside some date range.
- Parameters
term (str) – The term to filter on
dates (list[tuple[LogicalComp, str | datetime]]) – A list of tuples where the first value is the comparison and the second value is the value to compare to.
- Raises
ParserError – If dateutil was unable to parse the provided date string.:
OverflowError – If the provided date string would exceed C MAXINT (impressive).:
Notes
If any of the provided dates in a tuple is a string, it will be attempted to be parsed using the dateutil python library.
- get_all(anonymize: bool = False, **kwargs) Iterator[List[AnyDocument]]
Gets all documents for the associated namespace in the PAW, ignoring all the currently set filters.
- Parameters
anonymize (bool, optional) – Whether to anonymize the returned data, defaults to false.
**kwargs – Additional arguments to send to the search method
- Returns
A lazily paginated list of all the documents
- Return type
Iterator[List[AnyDocument]]
See also
- reset() Dict[str, Any]
Resets the current search to simply getting all documents and sorting by whatever ES decides is correct.
- Returns
The query dict for inspection. Automatically sets self.query as well.
- Return type
Dict[str, Any]
- search(proj: List[str] = [], anonymize: bool = False, **kwargs) Iterator[List[AnyDocument]]
Start searching with the constructed query.
- Parameters
proj (list[str], optional) – The list of projects to search through
anonymize (bool, optional) – Whether to anonymize the returned data, defaults to false
**kwargs – Additional keyword arguments to pass along to the vanilla elasticsearch search.
- Returns
A lazily paginated list of found documents. Will by default run in chunks of 1000 documents. This means that every time next is called, another, at most, 1000 documents are searched for and returned.
- Return type
Iterator[List[AnyDocument]]
Notes
Just like with the vanilla elasticsearch search, a query dict can be passed as a keyword argument. This dict can have different search constraints from those constructed by the search client. Before the search, both sets of search constraints are merged together. If both the provided dict and the constructed dict try to add conflicting constraints, the constraints from the provided dict are used.
This also means that the search construction can be circumvented entirely and the interface search can be directly used with a query object. This could be useful if you want to make use of the automatic pagination provided by this interface, but would prefer to construct the query yourself.
It is possible that the query you pass along manually conflicts in a more fundamental way with the constructed query (for example, combining match_all and a bool filter causes issues). Therefore it is recommended to only combine the two sources if you know what you are doing.
See also
elasticwrap.helpers.search_pit
,elasticsearch.search
- set_script_filter(script: str)
Sets a script for filtering.
Notes
This should really only be used for comparing 2 fields in the same document, I.E.
doc.val1 != doc.val2
. Every other case can be handled better in other ways.For more information see: For more information, see https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-script-query.html