README.md 17 KB
Newer Older
1
2
# Rare project - Data discovery

3
4
## Contribute

5
You might probably want to know how to contribute to the federation of data. That's great, let's have a look at the [WheatIS/Plant guide](./HOW-TO-JOIN-WHEATIS-AND-PLANT-FEDERATIONS.md) or the [RARe guide](./HOW-TO-JOIN-RARe-FEDERATION.md) to know how to.
6

Raphaël Flores's avatar
Raphaël Flores committed
7
If you do want to contribute to code or even only install the program on-premise it's great also, just keep reading below.
8

9
10
11
12
13
14
15
16
17
18
19
## Setup

### Backend

The project uses Spring (5.x) for the backend,
with Spring Boot.

You need to install:

- a recent enough JDK8

20
The application expects to connect on an Elasticsearch instance running on `http://127.0.0.1:9200`.
Exbrayat Cédric's avatar
Exbrayat Cédric committed
21
22
23
24
To have such an instance, simply run:

    docker-compose up

25
And this will start Elasticsearch and a Kibana instance (allowing to explore the data on http://localhost:5601).
Exbrayat Cédric's avatar
Exbrayat Cédric committed
26

27
28
29
Then at the root of the application, run `./gradlew build` to download the dependencies.
Then run `./gradlew bootRun` to start the app.

Exbrayat Cédric's avatar
Exbrayat Cédric committed
30
31
32
33
You can stop the Elastic Search and Kibana instances by running:

    docker-compose stop

34
35
### Frontend

36
The project uses Angular (7.x) for the frontend, with the Angular CLI.
37
38
39
40
41
42
43
44
45

You need to install:

- a recent enough NodeJS (8.11+)
- Yarn as a package manager (see [here to install](https://yarnpkg.com/en/docs/install))

Then in the `frontend` directory, run `yarn` to download the dependencies.
Then run `yarn start` to start the app, using the proxy conf to reroute calls to `/api` to the backend.

Raphaël Flores's avatar
Raphaël Flores committed
46
The application will be available on:
47
48
- http://localhost:4000/rare-dev for RARe (runs with: `yarn start:rare` or simply `yarn start`)
- http://localhost:4100/wheatis-dev for WheatIS (runs with: `yarn start:wheatis`)
49
50
51
52
53
54

## Build

To build the app, just run:

    ./gradlew assemble
Raphaël Flores's avatar
Raphaël Flores committed
55
56
or 
    ./gradlew assemble -Papp=wheatis
57

Raphaël Flores's avatar
Raphaël Flores committed
58
This will build a standalone jar at `backend/build/libs/rare.jar` or  `backend/build/libs/wheatis.jar`, that you can run with:
59
60

    java -jar backend/build/libs/rare.jar
Raphaël Flores's avatar
Raphaël Flores committed
61
62
63
    java -jar backend/build/libs/wheatis.jar

And the full app run on:
64

Raphaël Flores's avatar
Raphaël Flores committed
65
- http://localhost:8080/rare-dev
66
- http://localhost:8180/wheatis-dev
67
68
69
70
71
72


## CI

The `.gitlab-ci.yml` file describes how Gitlab is running the CI jobs.

73
74
75
It uses a base docker image named `urgi/docker-browsers`
available on [DockerHub](https://hub.docker.com/r/urgi/docker-browsers/)
and [INRA-MIA Gitlab](https://forgemia.inra.fr/urgi-is/docker-rare).
Raphaël Flores's avatar
Raphaël Flores committed
76
The image is based on `openjdk:8` and adds all stuff needed to run the tests
77
(ie. a Chrome binary with a headless Chrome in `--no-sandbox` mode).
78
79
80
81
82
83

We install `node` and `yarn` in `/tmp` (this is not the case for local builds)
to avoid symbolic links issues on Docker.

You can approximate what runs on CI by executing:

84
    docker run --rm -v "$PWD":/home/rare -w /home/rare urgi/docker-browsers ./gradlew build
Jean-Baptiste Nizet's avatar
Jean-Baptiste Nizet committed
85

86
87
88
89
90
91
92
93
94
95
96
97
Or also run a gitlab-runner as Gitlab-CI would do (minus the environment variables and caching system):

    gitlab-runner exec docker test

## Documentation

An API documentation describing most of the webservices can be generated using the
build task `asciidoctor`, which executes tests and generates documentation based on snippets generated
by these tests. The documentation is generated in the folder `backend/build/asciidoc/html5/index.html`/

    ./gradlew asciidoctor

Jean-Baptiste Nizet's avatar
Jean-Baptiste Nizet committed
98
99
## Harvest

100
Harvesting (i.e. importing documents stored in JSON files into Elasticsearch) consists in
101
creating the necessary index and aliases and Elasticsearch templates.
Jean-Baptiste Nizet's avatar
Jean-Baptiste Nizet committed
102

Raphaël Flores's avatar
Raphaël Flores committed
103
To create the index and its aliases execute the script below for local dev environment:
104
105
106

    ./scripts/createIndexAndAliases.sh

107
This script is a wrapper for the `./scripts/createIndexAndAliases4CI.sh` which handle some parameters to create
Raphaël Flores's avatar
Raphaël Flores committed
108
indices, aliases and so on, on another (possible remote) Elasticsearch for fitting to a specific environment:
109
110

    ./scripts/createIndexAndAliases4CI.sh -host localhost -app rare -env dev
Jean-Baptiste Nizet's avatar
Jean-Baptiste Nizet committed
111

Raphaël Flores's avatar
Raphaël Flores committed
112
You can run the scripts:
113
114

    ./scripts/harvestRare.sh
Raphaël Flores's avatar
Raphaël Flores committed
115
    ./scripts/harvestWheatis.sh
116
    
Raphaël Flores's avatar
Raphaël Flores committed
117
to trigger a harvest of the resources stored in the Git LFS directories `data/rare` and `data/wheatis` respectively.
Raphaël Flores's avatar
Raphaël Flores committed
118

119
120
## Indices and aliases

Raphaël Flores's avatar
Raphaël Flores committed
121
The application uses several physical indices, which (at least the resources index) can be rolled over automatically based on the policies defined in the
122
123
124
`./backend/src/test/resources/fr/inra/urgi/datadiscovery/dao/*_policy.json` files. This is based on the
[Index Lifecyle Management](https://www.elastic.co/guide/en/elasticsearch/reference/6.6/index-lifecycle-management.html)
provided by Elasticsearch.
125

Raphaël Flores's avatar
Raphaël Flores committed
126
127
128
129
 * one to store physical resources, containing the main content
 * one to store suggestions, use for the search type-ahead feature only

Both indices must be created explicitly before using the application. If not, requests to the web services will return errors.
130

Raphaël Flores's avatar
Raphaël Flores committed
131
132
133
134
Each index and alias below refers to `rare` application in `dev` environment, the equivalent shall be created for `wheatis` 
app in `dev` environment as same as in `beta` or `prod` environments. For brevity, only `rare-dev` is explained here.
{: .alert .alert-info}

135
136
137
The application doesn't use the physical resources index directly. Instead, it uses two aliases, that must be created 
before using the application:

138
139
 * `rare-dev-resource-index` is the alias used by the application to search for documents
 * `rare-dev-resource-harvest-index` is the alias used by the application to store documents when the harvest is triggered.
140
141
 
In normal operations, these two aliases should refer to the same physical resource index. The script
Raphaël Flores's avatar
Raphaël Flores committed
142
`createIndexAndAliases.sh` creates a physical index (named `rare-dev-resource-physical-index`) and creates these two aliases 
143
144
145
referring to this physical index.

Once the index and the aliases have been created, a harvest can be triggered. The first operation that a harvest
146
does is to create or update (put) the mapping for the document entity into the index aliased by `rare-dev-resource-harvest-index`. 
Raphaël Flores's avatar
Raphaël Flores committed
147
Then it parses the JSON files and stores them into this same index. Since the `rare-dev-resource-index` alias 
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
normally refers to the same physical index, searches will find the resources stored by the harvester.

### Why two aliases

Using two aliases is useful when deleting obsolete documents. This is actually done by removing everything
and then harvesting the new JSON files again, to re-populate the index from scratch.

Two scenarios are possible:

#### Deleting with some downtime

The harvest duration depends on the performance of Elasticsearch, of the performance of the harvester, and 
of course, of the number of documents to harvest. If you don't mind about having a period of time 
where the documents are not available, you can simply 

 - delete the physical index;
 - re-create it with its aliases;
 - trigger a new harvest.
 
Keep in mind that, with the current known set of documents (17172), on a development machine where everything is running
concurrently, when both the Elasticsearch server and the application are hot, a harvest only takes 12 seconds.
So, even if you have 10 times that number of documents (170K documents), it should only take around 2 minutes of downtime.
If you have 100 times that number of documents (1.7M documents), it should take around 20 minutes, which is still not a 
very long time.

(Your mileage may vary: I assumed a linear complexity here).

175
176
177
Here are curl commands illustrating the above scenario:
```
# delete the physical index and its aliases
Raphaël Flores's avatar
Raphaël Flores committed
178
curl -X DELETE "localhost:9200/rare-dev-resource-physical-index"
179
180

# recreate the physical index and its aliases
Raphaël Flores's avatar
Raphaël Flores committed
181
curl -X PUT "localhost:9200/rare-dev-resource-physical-index" -H 'Content-Type: application/json' -d'
182
183
{
    "aliases" : {
Raphaël Flores's avatar
Raphaël Flores committed
184
185
        "rare-dev-resource-index" : {},
        "rare-dev-resource-harvest-index" : {}
186
    }
187
    "settings": ...
188
189
190
191
}
'
```

192
193
**NOTE**: Every time a physical index is created, the settings must be specified, the same ay as in the 
`createIndexAndAliases.sh` script. The exact content of the settings is omitted here for brevity and readability.
Raphaël Flores's avatar
Raphaël Flores committed
194
{: .alert .alert-info}
195

196
197
198
199
#### Deleting with no downtime

If you don't want any downtime, you can instead use the following procedure:

Raphaël Flores's avatar
Raphaël Flores committed
200
201
202
 - create a new physical index (let's name it `rare-dev-resource-new-physical-index`);
 - delete the `rare-dev-resource-harvest-index` alias, and recreate it so that it refers to `rare-dev-resource-new-physical-index`;
 - trigger a harvest. During the harvest, the `rare-dev-resource-index` alias, used by the search,
203
   still refers to the old physical index, and it thus still works flawlessly;
Raphaël Flores's avatar
Raphaël Flores committed
204
205
 - once the harvest is finished, delete the `rare-dev-resource-index` alias, and recreate it so that it refers to 
   `rare-dev-resource-new-physical-index`. All the search operations will now use the new index, containing up-to-date
206
207
208
   documents;
 - delete the old physical index.
 
209
210
211
Here are curl commands illustrating the above scenario:
```
# create a new physical index
Raphaël Flores's avatar
Raphaël Flores committed
212
curl -X PUT "localhost:9200/rare-dev-resource-new-physical-index" -H 'Content-Type: application/json' -d'
213
214
215
{
  "settings": ...
}
216
217
'

Raphaël Flores's avatar
Raphaël Flores committed
218
# delete the `rare-dev-resource-harvest-index` alias, and recreate it so that it refers to `rare-dev-resource-new-physical-index`
219
220
221
curl -X POST "localhost:9200/_aliases" -H 'Content-Type: application/json' -d'
{
    "actions" : [
Raphaël Flores's avatar
Raphaël Flores committed
222
223
        { "remove" : { "index" : "rare-dev-resource-physical-index", "alias" : "rare-dev-resource-harvest-index" } },
        { "add" : { "index" : "rare-dev-resource-new-physical-index", "alias" : "rare-dev-resource-harvest-index" } }
224
225
226
227
228
229
230
231
    ]
}
'

# once the harvest is finished, delete the `resource-index` alias, and recreate it so that it refers to `resource-new-physical-index`
curl -X POST "localhost:9200/_aliases" -H 'Content-Type: application/json' -d'
{
    "actions" : [
Raphaël Flores's avatar
Raphaël Flores committed
232
233
        { "remove" : { "index" : "rare-dev-resource-physical-index", "alias" : "rare-dev-resource-index" } },
        { "add" : { "index" : "rare-dev-resource-new-physical-index", "alias" : "rare-dev-resource-index" } }
234
235
236
237
238
    ]
}
'

# delete the old physical index
Raphaël Flores's avatar
Raphaël Flores committed
239
curl -X DELETE "localhost:9200/rare-dev-resource-physical-index"
240
241
```
 
242
243
244
245
246
247
248
249
250
### Mapping migration

Another situation where you might need to reindex all the documents is when the mapping has changed and a new version
of the application must be redeployed. 

#### Upgrading with some downtime

This is the easiest and safest procedure, that I would recommend:

Raphaël Flores's avatar
Raphaël Flores committed
251
252
253
 - create a new physical index (let's name it `rare-dev-resource-new-physical-index`);
 - delete the `rare-dev-resource-harvest-index` and the `rare-dev-resource-index` aliases, and recreate them both so that they refer to 
   `rare-dev-resource-new-physical-index`;
254
255
256
257
258
259
260
 - stop the existing application, deploy and start the new one;
 - trigger a harvest;
 - once everything is running fine, remove the old physical index.
 
In case anything goes wrong, the two aliases can always be recreated to refer to the old physical index, and the old
application can be restarted.

261
262
263
Here are curl commands illustrating the above scenario:
```
# create a new physical index
Raphaël Flores's avatar
Raphaël Flores committed
264
curl -X PUT "localhost:9200/rare-dev-resource-new-physical-index" -H 'Content-Type: application/json' -d'
265
266
267
{
  "settings": ...
}
268
269
'

Raphaël Flores's avatar
Raphaël Flores committed
270
# delete the `rare-dev-resource-harvest-index` and the `rare-dev-resource-index` aliases, and recreate them both so that they refer to `rare-dev-resource-new-physical-index`
271
272
273
curl -X POST "localhost:9200/_aliases" -H 'Content-Type: application/json' -d'
{
    "actions" : [
Raphaël Flores's avatar
Raphaël Flores committed
274
275
276
277
        { "remove" : { "index" : "rare-dev-resource-physical-index", "alias" : "resource-harvest-index" } },
        { "add" : { "index" : "rare-dev-resource-new-physical-index", "alias" : "resource-harvest-index" } },
        { "remove" : { "index" : "rare-dev-resource-physical-index", "alias" : "rare-dev-resource-index" } },
        { "add" : { "index" : "rare-dev-resource-new-physical-index", "alias" : "rare-dev-resource-index" } }
278
279
280
281
282
    ]
}
'

# once everything is running fine, remove the old physical index.
Raphaël Flores's avatar
Raphaël Flores committed
283
curl -X DELETE "localhost:9200/rare-dev-resource-physical-index"
284
285
```

286
287
288
#### Upgrading with a very short downtime (or no downtime at all)

 - create a new physical index (let's name it `resource-new-physical-index`);
Raphaël Flores's avatar
Raphaël Flores committed
289
 - delete the `rare-dev-resource-harvest-index` alias, and recreate it so that it refers to `rare-dev-resource-new-physical-index`;
290
291
292
 - start the new application, on another machine, or on a different port, so that the new application code can be
   used to trigger a harvest with the new schema, while the old application is still running and exposed to the users
 - trigger the harvest on the **new** application
Raphaël Flores's avatar
Raphaël Flores committed
293
294
 - once the harvest is finished, delete the `rare-dev-resource-index` alias, and recreate it so that it refers to 
   `rare-dev-resource-new-physical-index`;
295
296
297
298
299
300
301
 - expose the new application to the users instead of the old one
 - stop the old application
 
How you execute these various steps depend on the production infrastructure, which is unknown to me. You could
use your own development server to start the new application and do the harvest, and then stop the production application,
deploy the new one and start it. Or you could have a reverse proxy in front of the application, and change its 
configuration to route to the new application once the harvest is done, for example.
302
303
304
305

Here are curl commands illustrating the above scenario:
```
# create a new physical index
Raphaël Flores's avatar
Raphaël Flores committed
306
curl -X PUT "localhost:9200/rare-dev-resource-new-physical-index" -H 'Content-Type: application/json' -d'
307
308
309
{
  "settings": ...
}
310
311
'

Raphaël Flores's avatar
Raphaël Flores committed
312
# delete the `rare-dev-resource-harvest-index` alias, and recreate it so that it refers to `rare-dev-resource-new-physical-index`
313
314
315
curl -X POST "localhost:9200/_aliases" -H 'Content-Type: application/json' -d'
{
    "actions" : [
Raphaël Flores's avatar
Raphaël Flores committed
316
317
        { "remove" : { "index" : "rare-dev-resource-physical-index", "alias" : "rare-dev-resource-harvest-index" } },
        { "add" : { "index" : "rare-dev-resource-new-physical-index", "alias" : "rare-dev-resource-harvest-index" } }
318
319
320
321
322
323
324
325
    ]
}
'

# once the harvest is finished, delete the `resource-index` alias, and recreate it so that it refers to `resource-new-physical-index`
curl -X POST "localhost:9200/_aliases" -H 'Content-Type: application/json' -d'
{
    "actions" : [
Raphaël Flores's avatar
Raphaël Flores committed
326
327
        { "remove" : { "index" : "rare-dev-resource-physical-index", "alias" : "rare-dev-resource-index" } },
        { "add" : { "index" : "rare-dev-resource-new-physical-index", "alias" : "rare-dev-resource-index" } }
328
329
330
331
    ]
}
'
```
332
333
334
    
## Spring Cloud config

335
On bootstrap, the application will try to connect to a remote Spring Cloud config server to fetch its configuration.
336
The details of this remote server are filled in the `bootstrap.yml` file.
337
By default, it tries to connect to the local server on http://localhost:8888
338
339
340
341
342
343
344
but it can of course be changed, or even configured via the `SPRING_CONFIG_URI` environment variable.

It will try to fetch the configuration for the application name `rare`, and the default profile.
If such a configuration is not found, it will then fallback to the local `application.yml` properties.
To avoid running the Spring Cloud config server every time when developing the application,
all the properties are still available in `application.yml` even if they are configured on the remote Spring Cloud server as well.

345
If you want to use the Spring Cloud config app locally, see https://forgemia.inra.fr/urgi-is/data-discovery-config
346
347
348
349
350
351

The configuration is currently only read on startup,
meaning the application has to be reboot if the configuration is changed on the Spring Cloud server.
For a dynamic reload without restarting the application, 
see http://cloud.spring.io/spring-cloud-static/Finchley.SR1/single/spring-cloud.html#refresh-scope
to check what has to be changed.
352

353
354
355
356
357
In case of testing configuration from the config server, one may use a dedicated branch on `data-discovery-config` project 
and append the `--spring.cloud.config.label=<branch name to test>` parameter when starting the application's executable jar.
More info on how pass a parameter to a Spring Boot app: 
https://docs.spring.io/spring-boot/docs/current/reference/html/boot-features-external-config.html#boot-features-external-config

358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
## Building other apps

By default, the built application is RARe. But this project actually allows building other
applications (WheatIS, for the moment, but more could come).

To build a different app, specify an `app` property when building. For example, to assemble
the WheatIS app, run the following command

    ./gradlew assemble -Papp=wheatis
    
You can also run the backend WheatIS application using

    ./gradlew bootRun -Papp=wheatis
    
Adding this property has the following consequences:

 - the generated jar file (in `backend/build/libs`) is named `wheatis.jar` instead of `rare.jar`;
 - the Spring active profile in `bootstrap.yml` is `wheatis-app` instead of `rare-app`;
376
377
 - the frontend application built and embedded inside the jar file is the WheatIS frontend application instead of the
 RARe frontend application, i.e. the frontend command `yarn build:wheatis` is executed instead of the command `yarn:rare`.
378
379
380
381
 
Since the active Spring profile is different, all the properties specific to this profile
are applies. In particular:
 
Raphaël Flores's avatar
Raphaël Flores committed
382
 - the context path of the application is `/wheatis-dev` instead of `/rare-dev`; 
383
384
 - the Elasticsearch prefix used for the index aliases is different.

Raphaël Flores's avatar
Raphaël Flores committed
385
See the `backend/src/main/resources/application.yml` file for details.