Guillaume Cornut requested to merge feat/deep_result_window_pagination into master May 13, 2019

For the moment we blocked the pagination of any API to be under the max result window of ElasticSearch (pageSize * page < 1000)

@raphael.flores @celia.michotey @cyril.pommier @jeremy.destin

In this branch, I propose a solution to implement deep pagination on BrAPI calls. We've had problems with BrAPI client trying to harvest our big datasets, and we'd like to propose a reasonable solution without having to change all of the BrAPI.
The data discovery calls are for now still restricted on pagination in this branch (see the @MaxResultWindowPagination annotation on DataDiscoveryCriteriaImpl.java)

Implementation

In order to fetch data over the max result window we can either use the scroll API or the search after API. The scroll API cannot be used to fetch pages in random access whereas the search after API can so I chose to implement deep pagination with the search after API.

The search after API requires the result to be sorted (here we use the sort on _id by default) and the previous page last document sort values.
The ESGenericFindRepository in charge of finding documents by criteria now detects when a request goes over the ES max result window and will:

Get the previous page last sort values
Get the current page using ES search_after filled with the last page, last sort values

When the requested page is under the ES max result window, the repository does a simple search with size/from pagination and do not use the search_after

To be efficient a cache is used (with a TTL of 1H) to store every page last sort values for every query.

The cache is structured as such: QueryWithoutPagination(String) => Page{size,from} => LastSortValues(Object[]) (see ESGenericFindRepository).

If the previous page for the current query is not in cache, the repository will paginate from the last page under the max result window to the current page and store each sort values in cache for future use. In classical scenarios, the BrAPI client will fetch a page for a given query/criteria, the repository will store this page's last document sort values and when the client request the next page, the repository has in cache everything it needs to fetch the current page.

Remaining TODOs

Performance cost estimation
- Verify memory/CPU consumption for both web server and ES cluster
- Verify that the performance of document search under the max result window is the same as or close to the code in master
- Verify that the performance of document search over the max result window is reasonable
Add extensive repository unit tests with mocks (and maybe even functional tests with real accessing the ES cluster?)
Validate or not this approach

Edited Dec 19, 2024 by LAKMOURI NAJWA

Admin message

WIP: Deep result window pagination (getting over the ES max result window)

Implementation

Remaining TODOs

Merge request reports