WIP: Deep result window pagination (getting over the ES max result window)
For the moment we blocked the pagination of any API to be under the max result window of ElasticSearch (pageSize * page < 1000)
@raphael.flores @celia.michotey @cyril.pommier @jeremy.destin
In this branch, I propose a solution to implement deep pagination on BrAPI calls. We've had problems with BrAPI client trying to harvest our big datasets, and we'd like to propose a reasonable solution without having to change all of the BrAPI.
The data discovery calls are for now still restricted on pagination in this branch (see the @MaxResultWindowPagination
annotation on DataDiscoveryCriteriaImpl.java)
Implementation
In order to fetch data over the max result window we can either use the scroll API or the search after API. The scroll API cannot be used to fetch pages in random access whereas the search after API can so I chose to implement deep pagination with the search after API.
The search after API requires the result to be sorted (here we use the sort on _id
by default) and the previous page last document sort values.
The ESGenericFindRepository
in charge of finding documents by criteria now detects when a request goes over the ES max result window and will:
- Get the previous page last sort values
- Get the current page using ES search_after filled with the last page, last sort values
When the requested page is under the ES max result window, the repository does a simple search with size/from
pagination and do not use the search_after
To be efficient a cache is used (with a TTL of 1H) to store every page last sort values for every query.
The cache is structured as such: QueryWithoutPagination(String) => Page{size,from} => LastSortValues(Object[])
(see ESGenericFindRepository).
If the previous page for the current query is not in cache, the repository will paginate from the last page under the max result window to the current page and store each sort values in cache for future use. In classical scenarios, the BrAPI client will fetch a page for a given query/criteria, the repository will store this page's last document sort values and when the client request the next page, the repository has in cache everything it needs to fetch the current page.
Remaining TODOs
-
Performance cost estimation -
Verify memory/CPU consumption for both web server and ES cluster -
Verify that the performance of document search under the max result window is the same as or close to the code in master
-
Verify that the performance of document search over the max result window is reasonable
-
-
Add extensive repository unit tests with mocks (and maybe even functional tests with real accessing the ES cluster?) -
Validate or not this approach