Using ElasticWrap
ElasticWrap is a proprietary library created with the intention of making retrieving and submitting data into the PAW as simple as possible, even for people with no knowledge of ElasticSearch. A more technical overview of how it works and how to extend it can be found in ElasticWrap. This page will focus instead on how and why to use ElasticWrap
Why use ElasticWrap?
The main data store of the PAW is ElasticSearch, with which you may or may not have experience. ElasticSearch’s main interface is sending JSON GET/POST requests to get/post data. There do exist, however, various libraries for interacting with ElasticSearch using various languages, so why use ElasticWrap instead of these already existing libraries?
In general, we found that most libraries require a non-trivial understanding of ElasticSearch to actually use properly, acting more as a thing wrapper around sending raw JSON than an ORM helping you get the data you want. There do exist more high-level libraries (such as ElasticSearch DSL), but they are bogged down by having to work on all types of data. ElasticWrap, on the other hand, was purpose-built for dealing with the PAW, meaning it is possible for the library to know the exact structure of the documents as they are stored in the ElasticSearch data store and to optimize accordingly.
How to use ElasticWrap
What follows here is a natural-language description of basic usage of ElasticWrap. For a more comprehensive overview, see the source documentation of ElasticWrap
ElasticWrap uses liberal use of type hinting features to make it easier to write correct python code using it. For the best experience, it is recommended that you have mypy or PyRight installed.
The ElasticWrap
object
ElasticWrap is centered around a main ElasticWrap
object, which itself is a wrapper around the official ElasticSearch object. To start playing around with ElasticWrap
, you can install the library and open the interactive prompt:
>>> from elasticwrap import ElasticWrap
>>> client = ElasticWrap(hosts=["http://admin:admin@localhost:9200"])
Note here that the hosts
argument is optional. If you do not specify this argument, it defaults to http://localhost:9200
. For more advanced construction arguments, we refer to the official ES python library documentation.
Searching Data
ElasticWrap
functions with the concept of Search Clients. A search client is an object which is specifically focused on dealing with a specific structure of data. When working with Search Clients, you first construct the search using various methods provided by the client, after which you retrieve the results using the associated search
method.
As of writing, the ElasticWrap
object comes with 1 Search Client, focused on dealing with structured commit data as retrieved by Perceval. For our examples, we will be using this search client (found by default at ElasticWrap.commits
), but the ideas should carry over to other search clients as well.
commits = client.commits
commits.filter_before_date(commits.DateEnum.AUTHOR_DATE, "2018-01-01", inclusive=False)
before2018 = commits.search(["example"])
for page in before2018:
for doc in page:
...
commits.reset()
In this example, we add a filter that ensures that all resulting documents will have an AuthorDate
from before 2018-01-01. We then have a nested loop. The search
method will always paginate, meaning that you will need a nested structure like this. Important to remember is that the result of search
is a Python Generator which returns a new page every next
call. Finally, we call the reset
method. An ElasticWrap search client remembers the query, meaning that subsequent calls to search
should return the same data (assuming no new documents have been added to the data store). By calling the reset
method, we remove the query and can start over again.
Search Arguments
search
takes a number of arguments, some of which more advanced than others. For general use, the following arguments are important:
* proj
- The list of projects this search should be performed on. Every index inside ElasticSearch is assumed to follow the format {proj}_{namespace}
. In ElasticWrap every search client is assigned a namespace (such as git
for the commits client) and search
will automatically construct the indices from the list of projects provided and the associated namespace. If no list of projects is provided, search
will return data from all indices related to this namespace.
* anonymize
- If set to True
, all resulting data will be anonymized. The exact way this is done depends on the type of data. Defaults to False
* size
- The amount of documents per page to return. Defaults to 1000
* page_max
- The amount of pages to return at maximum. If set to 0
, will return as many pages as exist. Defaults to 0
.
Generic Searching
In our previous example, we used the existing CommitClient
and it’s own specific filter_before_date
method. However, it may be that you want to search data for which no Search Client exists yet, but you do want to use some of the features provided by ElasticWrap
(such as search
automatically dealing with pagination). In this case, you can use the GenericSearchClient
object.
from elasticwrap.search_clients.search import GenericSearchClient
search = GenericSearchClient(client, "example")
search.add_filters([("some_field", ["some_value"])])
results = search.search()
This example will return all the documents in the namespace example
in which some_field
is equal to some_value
. For more information, we again refer to the source documentation.
If you want to not deal with namespaces entirely, you can use the UnstructuredSearchClient
. It works in the same way as the GenericSearchClient
, but you provide it a list of indices to
search on, instead of a list of projects.
Saving Data
You can also use ElasticWrap to store arbitrary data in the PAW. As of writing, there is support for saving pandas DataFrame
-s, or arbitrary python dictionaries.
from elasticwrap.data.dictionary import DictionaryDataSource
from elasticwrap.data.dataframe import DataFrameDataSource
from pandas import DataFrame
dict_data {...}
frame_data = DataFrame(...)
client.save(DictionaryDataSource(dict_data), "example", "dict")
client.save(DataFrameDataSource(frame_data), "example", "dataframe", "false")
This example will store the dictionary data in the index example_dict
(so it is project example
and namespace dict
) and will store the dataframe data in the index example_dataframe
.
Of note here is the last argument in the second call of client.save
. By default, this is set to "wait_for"
, meaning that the method will return after ElasticSearch has confirmed the data is actually retrievable from the data store. If you set it to "false"
it will instead return instantly. You can also set it to "true"
to force ElasticSearch to refresh, but this is not recommended.
Converting Data
Just as you may want to store date in the PAW using your favorite format, you may also want to take data from the PAW and convert it to your favorite format. The same objects we used to save the data can also be used to convert it the other way around.
for page in client.commits.get_all():
converted = next(DataFrameDataSource.convert_documents(page))
...
This example will get all the commit data in the PAW and convert each page into a DataFrame object.