Adding searchers for custom data ElasticSearch patterns
Defining a domain-specific search client for ElasticWrap is relatively simple, one can simply inherit from the generic SearchInterface
and must only specify 2 elements:
The document structure
How to anonymize the data by implementing
anonymize_data
.
Defining the Data
For the purposes of type checking and to make it easier to know what data you are working with, it is recommended to define the data structure of a document as a TypedDict.
It is important to note here that these data structures are not checked at runtime. This was a decision made to ensure that users can freely modify the data locally without having to worry about being constrained by a specific structure. This does introduce an implicit assumption that every document in an index has the exact same structure. If a document with a wholly different structure is entered, it may cause runtime errors in the future (though simply leaving out some keys/adding a few others should not be an issue.)
Common Data Structures
First, it is important to understand some common data structures found when using Perceval+ElasticSearch. These common data structures are already implemented
in ElasticWrap
and are the most likely candidate for extension.
ElasticSearch
At the top level, we have the generic structure for any document as returned by ElasticSearch
, which looks as follows:
{
_type: str,
_score: float,
_id: str,
_index: str,
_source: SourceDocument
}
This structure is defined by the existing Document
TypedDict
. A variant, PostDocument
defines the structure for sending data to ES (which is exactly the same, except without the _type
and _score
fields and with _id
being optional).
Most of these fields are simply metadata, with the most interesting one being _source
. This field contains the actual source data of the returned document. In terms of ElasticWrap
we define this field to be of the type SourceDocument | Dict[str, Any]
. SourceDocument
is simply an empty TypedDict
and if your data originates
from anywhere except for Perceval
, it is the class you want to implement (the Dict[str, Any]
definition is for unknown/unstructured data).
Perceval
Perceval retrieves data from various data sources, but it always provides metadata in the same way. The structure of which being as follows:
{
backend_name: str
backend_version: str
perceval_version: str
timestamp: float
origin: str
uuid: str
updated_on: float
classified_fields_filtered: Optional[List[str]]
category: str
search_fields: dict
tag: str
metadata__updated_on: str
metadata__timestamp: str
data: PercevalDataDocument
}
This structure is defined by the existing PercevalSourceDocument
. We are mostly interested in the data
field, which contains the actual data Perceval has retrieved.
This field is defined to be a PercevalDataDocument
, which is the class you will want to extend when dealing with Perceval data.
Actually Defining The Data
Once you have established whether the data you are getting comes from Perceval or not, it is time to implement your data structure. Simply create a new class which inherits from either PercevalDataDocument
or SourceDocument
and define the data as needed using the TypedDict format.
Because Document
, PostDocument
and PercevalSourceDocument
are defined as generic types, defining the final document structure is simple. Let us have created our new class Foo
.
- If
Foo
is aSourceDocument
then define FooSourceDocument = Foo
(this is optional, but we write it this way for the manual to keep it general)FooDocument = Document[FooSourceDocument]
- If
- If
Foo
is aPercevalDataDocument
then define FooSourceDocument = PercevalSourceDocument[Foo]
FooDocument = Document[FooSourceDocument]
- If
Congratulations, you have now defined the data structure in a way that it can be easily reused by ElasticWrap and understood by most type checkers.
Creating the Search Client
Now that we have defined our FooDocument
and FooSourceDocument
. We can finally implement our search client. Simply create a new class as follows:
class FooClient(SearchInterface[FooDocument]):
def __init__(self, client):
super().__init__(client, "foo", "Elasticwrap_Foo")
def anonymize_data(self, data: FooSourceDocument):
...
Congratulations! You have now created a basic search client. For the constructor, the client
argument that is passed should be your main ElasticWrap
object. In our example, "foo"
is the namespace for this client and "Elasticwrap_Foo"
is just the name for the logger. You can now add more methods as you deem necessary for this search client.