README.md 15.3 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
# Rare project - Data discovery

## Setup

### Backend

The project uses Spring (5.x) for the backend,
with Spring Boot.

You need to install:

- a recent enough JDK8

Exbrayat Cédric's avatar
Exbrayat Cédric committed
14
15
16
17
18
19
20
21
The application expects to connect on an ElasticSearch instance running on `http://127.0.0.1:9300`,
in a cluster named `es-rare`.
To have such an instance, simply run:

    docker-compose up

And this will start ElasticSearch and a Kibana instance (allowing to explore the data on http://localhost:5601).

22
23
24
Then at the root of the application, run `./gradlew build` to download the dependencies.
Then run `./gradlew bootRun` to start the app.

Exbrayat Cédric's avatar
Exbrayat Cédric committed
25
26
27
28
You can stop the Elastic Search and Kibana instances by running:

    docker-compose stop

29
30
31
32
33
34
35
36
37
38
39
40
41
### Frontend

The project uses Angular (6.x) for the frontend,
with the Angular CLI.

You need to install:

- a recent enough NodeJS (8.11+)
- Yarn as a package manager (see [here to install](https://yarnpkg.com/en/docs/install))

Then in the `frontend` directory, run `yarn` to download the dependencies.
Then run `yarn start` to start the app, using the proxy conf to reroute calls to `/api` to the backend.

42
The application will be available on http://localhost:4200/rare
43
44
45
46
47
48
49
50
51
52
53

## Build

To build the app, just run:

    ./gradlew assemble

This will build a standalone jar at `backend/build/libs/rare.jar`, that you can run with:

    java -jar backend/build/libs/rare.jar

54
And the full app runs on http://localhost:8080/rare
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72


## CI

The `.gitlab-ci.yml` file describes how Gitlab is running the CI jobs.

It uses a base docker image named `ninjasquad/docker-rare`
available on [DockerHub](https://hub.docker.com/r/ninjasquad/docker-rare/)
and [Github](https://github.com/Ninja-Squad/docker-rare).
The image is based on `openjdk:8` and adds a Chrome binary to let us run the frontend tests
(with a headless Chrome in `--no-sandbox` mode).

We install `node` and `yarn` in `/tmp` (this is not the case for local builds)
to avoid symbolic links issues on Docker.

You can approximate what runs on CI by executing:

    docker run --rm -v "$PWD":/home/rare -w /home/rare ninjasquad/docker-rare ./gradlew build
Jean-Baptiste Nizet's avatar
Jean-Baptiste Nizet committed
73
74
75
76

## Harvest

Harvesting (i.e. importing genetic resources stored in JSON files into ElasticSearch) consists in
77
creating the necessary index and aliases, and then placing the JSON files into a directory where the server can find them.
Jean-Baptiste Nizet's avatar
Jean-Baptiste Nizet committed
78

79
80
81
82
83
To create the index and its aliases execute the script 

    ./scripts/createIndexAndAliases.sh

The directory, by default is `/tmp/rare/resources`. But it's externalized into the Spring Boot property
Jean-Baptiste Nizet's avatar
Jean-Baptiste Nizet committed
84
85
86
`rare.resource-dir`, so it can be easily changed by modifying the value of this property (using an 
environment variable for example).

87
88
89
90
91
92
93
You can run the script:

    ./scripts/harvestRare.sh
    
to trigger a harvest of the resources stored in the Git LFS directory `data/rare`.
You can of course do the same for WheatIS with `./scripts/harvestWheatis.sh`.
    
Jean-Baptiste Nizet's avatar
Jean-Baptiste Nizet committed
94
95
The files must have the extension `.json`, and must be stored in that directory (not in a sub-directory).
Once the files are ready and the server is started, the harvest is triggered by sending a POST request
96
97
98
to the endpoint `/api/harvests`, as described in the API documentation that you can generate using the 
build task `asciidoctor`, which executes tests and generates documentation based on snippets generated 
by these tests. The documentation is generated in the folder `backend/build/asciidoc/html5/index.html`/
Jean-Baptiste Nizet's avatar
Jean-Baptiste Nizet committed
99

100
    ./gradlew asciidoctor
101
102
103
104
105
106
107
108
109
110
111
112
113

## Indices and aliases

The application uses two physical indices: 

 * one to store the harvest results. This one is created automatically if it doesn't exist yet when the application starts.
   It doesn't contain important data, and can be deleted and recreated if really needed.
 * one to store physical resources. This one must be created explicitly before using the application. If not,
 requests to the web services will return errors.

The application doesn't use the physical resources index directly. Instead, it uses two aliases, that must be created 
before using the application:

114
115
 * `rare-resource-index` is the alias used by the application to search for genetic resources
 * `rare-resource-harvest-index` is the alias used by the application to store genetic resources when the harvest is triggered.
116
117
 
In normal operations, these two aliases should refer to the same physical resource index. The script
118
`createIndexAndAliases.sh` creates a physical index (named `rare-resource-physical-index`) and creates these two aliases 
119
120
121
referring to this physical index.

Once the index and the aliases have been created, a harvest can be triggered. The first operation that a harvest
122
123
does is to create or update (put) the mapping for the genetic resource entity into the index aliased by `rare-resource-harvest-index`. 
Then it parses the JSON files and stores them into this same index. Since the `rare-resource-index` alias 
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
normally refers to the same physical index, searches will find the resources stored by the harvester.

### Why two aliases

Using two aliases is useful when deleting obsolete documents. This is actually done by removing everything
and then harvesting the new JSON files again, to re-populate the index from scratch.

Two scenarios are possible:

#### Deleting with some downtime

The harvest duration depends on the performance of Elasticsearch, of the performance of the harvester, and 
of course, of the number of documents to harvest. If you don't mind about having a period of time 
where the documents are not available, you can simply 

 - delete the physical index;
 - re-create it with its aliases;
 - trigger a new harvest.
 
Keep in mind that, with the current known set of documents (17172), on a development machine where everything is running
concurrently, when both the Elasticsearch server and the application are hot, a harvest only takes 12 seconds.
So, even if you have 10 times that number of documents (170K documents), it should only take around 2 minutes of downtime.
If you have 100 times that number of documents (1.7M documents), it should take around 20 minutes, which is still not a 
very long time.

(Your mileage may vary: I assumed a linear complexity here).

151
152
153
Here are curl commands illustrating the above scenario:
```
# delete the physical index and its aliases
154
curl -X DELETE "localhost:9200/rare-resource-physical-index"
155
156

# recreate the physical index and its aliases
157
curl -X PUT "localhost:9200/rare-resource-physical-index" -H 'Content-Type: application/json' -d'
158
159
{
    "aliases" : {
160
161
        "rare-resource-index" : {},
        "rare-resource-harvest-index" : {}
162
    }
163
    "settings": ...
164
165
166
167
}
'
```

168
169
170
**NOTE**: Every time a physical index is created, the settings must be specified, the same ay as in the 
`createIndexAndAliases.sh` script. The exact content of the settings is omitted here for brevity and readability.

171
172
173
174
#### Deleting with no downtime

If you don't want any downtime, you can instead use the following procedure:

175
176
177
 - create a new physical index (let's name it `rare-resource-new-physical-index`);
 - delete the `rare-resource-harvest-index` alias, and recreate it so that it refers to `rare-resource-new-physical-index`;
 - trigger a harvest. During the harvest, the `rare-resource-index` alias, used by the search,
178
   still refers to the old physical index, and it thus still works flawlessly;
179
180
 - once the harvest is finished, delete the `rare-resource-index` alias, and recreate it so that it refers to 
   `rare-resource-new-physical-index`. All the search operations will now use the new index, containing up-to-date
181
182
183
   documents;
 - delete the old physical index.
 
184
185
186
Here are curl commands illustrating the above scenario:
```
# create a new physical index
187
curl -X PUT "localhost:9200/rare-resource-new-physical-index" -H 'Content-Type: application/json' -d'
188
189
190
{
  "settings": ...
}
191
192
'

193
# delete the `rare-resource-harvest-index` alias, and recreate it so that it refers to `rare-resource-new-physical-index`
194
195
196
curl -X POST "localhost:9200/_aliases" -H 'Content-Type: application/json' -d'
{
    "actions" : [
197
198
        { "remove" : { "index" : "rare-resource-physical-index", "alias" : "rare-resource-harvest-index" } },
        { "add" : { "index" : "rare-resource-new-physical-index", "alias" : "rare-resource-harvest-index" } }
199
200
201
202
203
204
205
206
    ]
}
'

# once the harvest is finished, delete the `resource-index` alias, and recreate it so that it refers to `resource-new-physical-index`
curl -X POST "localhost:9200/_aliases" -H 'Content-Type: application/json' -d'
{
    "actions" : [
207
208
        { "remove" : { "index" : "rare-resource-physical-index", "alias" : "rare-resource-index" } },
        { "add" : { "index" : "rare-resource-new-physical-index", "alias" : "rare-resource-index" } }
209
210
211
212
213
    ]
}
'

# delete the old physical index
214
curl -X DELETE "localhost:9200/rare-resource-physical-index"
215
216
```
 
217
218
219
220
221
222
223
224
225
### Mapping migration

Another situation where you might need to reindex all the documents is when the mapping has changed and a new version
of the application must be redeployed. 

#### Upgrading with some downtime

This is the easiest and safest procedure, that I would recommend:

226
227
228
 - create a new physical index (let's name it `rare-resource-new-physical-index`);
 - delete the `rare-resource-harvest-index` and the `rare-resource-index` aliases, and recreate them both so that they refer to 
   `rare-resource-new-physical-index`;
229
230
231
232
233
234
235
 - stop the existing application, deploy and start the new one;
 - trigger a harvest;
 - once everything is running fine, remove the old physical index.
 
In case anything goes wrong, the two aliases can always be recreated to refer to the old physical index, and the old
application can be restarted.

236
237
238
Here are curl commands illustrating the above scenario:
```
# create a new physical index
239
curl -X PUT "localhost:9200/rare-resource-new-physical-index" -H 'Content-Type: application/json' -d'
240
241
242
{
  "settings": ...
}
243
244
'

245
# delete the `rare-resource-harvest-index` and the `rare-resource-index` aliases, and recreate them both so that they refer to `rare-resource-new-physical-index`
246
247
248
curl -X POST "localhost:9200/_aliases" -H 'Content-Type: application/json' -d'
{
    "actions" : [
249
250
251
252
        { "remove" : { "index" : "rare-resource-physical-index", "alias" : "resource-harvest-index" } },
        { "add" : { "index" : "rare-resource-new-physical-index", "alias" : "resource-harvest-index" } },
        { "remove" : { "index" : "rare-resource-physical-index", "alias" : "rare-resource-index" } },
        { "add" : { "index" : "rare-resource-new-physical-index", "alias" : "rare-resource-index" } }
253
254
255
256
257
    ]
}
'

# once everything is running fine, remove the old physical index.
258
curl -X DELETE "localhost:9200/rare-resource-physical-index"
259
260
```

261
262
263
#### Upgrading with a very short downtime (or no downtime at all)

 - create a new physical index (let's name it `resource-new-physical-index`);
264
 - delete the `rare-resource-harvest-index` alias, and recreate it so that it refers to `rare-resource-new-physical-index`;
265
266
267
 - start the new application, on another machine, or on a different port, so that the new application code can be
   used to trigger a harvest with the new schema, while the old application is still running and exposed to the users
 - trigger the harvest on the **new** application
268
269
 - once the harvest is finished, delete the `rare-resource-index` alias, and recreate it so that it refers to 
   `rare-resource-new-physical-index`;
270
271
272
273
274
275
276
 - expose the new application to the users instead of the old one
 - stop the old application
 
How you execute these various steps depend on the production infrastructure, which is unknown to me. You could
use your own development server to start the new application and do the harvest, and then stop the production application,
deploy the new one and start it. Or you could have a reverse proxy in front of the application, and change its 
configuration to route to the new application once the harvest is done, for example.
277
278
279
280

Here are curl commands illustrating the above scenario:
```
# create a new physical index
281
curl -X PUT "localhost:9200/rare-resource-new-physical-index" -H 'Content-Type: application/json' -d'
282
283
284
{
  "settings": ...
}
285
286
'

287
# delete the `rare-resource-harvest-index` alias, and recreate it so that it refers to `rare-resource-new-physical-index`
288
289
290
curl -X POST "localhost:9200/_aliases" -H 'Content-Type: application/json' -d'
{
    "actions" : [
291
292
        { "remove" : { "index" : "rare-resource-physical-index", "alias" : "rare-resource-harvest-index" } },
        { "add" : { "index" : "rare-resource-new-physical-index", "alias" : "rare-resource-harvest-index" } }
293
294
295
296
297
298
299
300
    ]
}
'

# once the harvest is finished, delete the `resource-index` alias, and recreate it so that it refers to `resource-new-physical-index`
curl -X POST "localhost:9200/_aliases" -H 'Content-Type: application/json' -d'
{
    "actions" : [
301
302
        { "remove" : { "index" : "rare-resource-physical-index", "alias" : "rare-resource-index" } },
        { "add" : { "index" : "rare-resource-new-physical-index", "alias" : "rare-resource-index" } }
303
304
305
306
    ]
}
'
```
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
    
## Spring Cloud config

On bootstrap, the application will try to connect to a remote Spring Cloud config server
to fetch its configuration.
The details of this remote server are filled in the `bootstrap.yml` file.
By default, it tries to connect to the remote server on http://localhost:8888
but it can of course be changed, or even configured via the `SPRING_CONFIG_URI` environment variable.

It will try to fetch the configuration for the application name `rare`, and the default profile.
If such a configuration is not found, it will then fallback to the local `application.yml` properties.
To avoid running the Spring Cloud config server every time when developing the application,
all the properties are still available in `application.yml` even if they are configured on the remote Spring Cloud server as well.

If you want to use the Spring Cloud config app locally, 
see https://forgemia.inra.fr/urgi-is/data-discovery-config

The configuration is currently only read on startup,
meaning the application has to be reboot if the configuration is changed on the Spring Cloud server.
For a dynamic reload without restarting the application, 
see http://cloud.spring.io/spring-cloud-static/Finchley.SR1/single/spring-cloud.html#refresh-scope
to check what has to be changed.
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357

## Building other apps

By default, the built application is RARe. But this project actually allows building other
applications (WheatIS, for the moment, but more could come).

To build a different app, specify an `app` property when building. For example, to assemble
the WheatIS app, run the following command

    ./gradlew assemble -Papp=wheatis
    
You can also run the backend WheatIS application using

    ./gradlew bootRun -Papp=wheatis
    
Adding this property has the following consequences:

 - the generated jar file (in `backend/build/libs`) is named `wheatis.jar` instead of `rare.jar`;
 - the Spring active profile in `bootstrap.yml` is `wheatis-app` instead of `rare-app`;
 - the frontend application built and embedded inside the jar file is the WheatIS frontend application instead of the RARe frontend application, i.e. the frontend command `yarn build:wheatis` is executed instead of the command `yarn:rare`.
 
Since the active Spring profile is different, all the properties specific to this profile
are applies. In particular:
 
 - the context path of the application is `/wheatis` instead of `/rare`; 
 - the resource directory where the JSON files to harvest are looked up is different;
 - the Elasticsearch prefix used for the index aliases is different.

See the `application.yml` file for details.