Skip to content
GitLab
Menu
Projects
Groups
Snippets
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
urgi-is
data-discovery
Commits
e9787b0e
Commit
e9787b0e
authored
Aug 09, 2018
by
Jean-Baptiste Nizet
Browse files
fix: avoid adding English stopwords to the suggestions
fix
#10
parent
07e8ad09
Changes
2
Hide whitespace changes
Inline
Side-by-side
backend/src/main/java/fr/inra/urgi/rare/domain/IndexedGeneticResource.java
View file @
e9787b0e
...
...
@@ -7,10 +7,12 @@ import java.util.ArrayList;
import
java.util.Collection
;
import
java.util.Collections
;
import
java.util.List
;
import
java.util.Locale
;
import
java.util.Objects
;
import
java.util.stream.Stream
;
import
com.fasterxml.jackson.annotation.JsonUnwrapped
;
import
org.apache.lucene.analysis.en.EnglishAnalyzer
;
import
org.apache.lucene.analysis.standard.StandardTokenizer
;
import
org.apache.lucene.analysis.tokenattributes.CharTermAttribute
;
import
org.springframework.data.elasticsearch.annotations.Document
;
...
...
@@ -103,8 +105,11 @@ public final class IndexedGeneticResource {
* Uses the standard tokenizer of Lucene (which is itself used by ElasticSearch) to tokenize the description.
* This makes sure that words in the index used by the full-text search are the same as the ones in the suggestions,
* used to autocomplete terms. Othwerwise, we could have suggestions that lead to no search result.
*
* Note that words that are less than 3 characters-long are excluded from the suggestions, since it doesn't make
* much sense to suggest those words, and since the UI only starts suggesting after 2 characters anyway.
*
* Words which, after being lowercased, belong to the set of English stopwords, are also excluded.
*/
private
Stream
<
String
>
extractTokensOutOfDescription
(
String
description
)
{
if
(
description
==
null
)
{
...
...
@@ -120,7 +125,7 @@ public final class IndexedGeneticResource {
List
<
String
>
terms
=
new
ArrayList
<>();
while
(
tokenizer
.
incrementToken
())
{
String
word
=
termAttribute
.
toString
();
if
(
word
.
length
()
>
2
)
{
if
(
word
.
length
()
>
2
&&
!
EnglishAnalyzer
.
getDefaultStopSet
().
contains
(
word
.
toLowerCase
(
Locale
.
ENGLISH
))
)
{
terms
.
add
(
word
);
}
}
...
...
backend/src/test/java/fr/inra/urgi/rare/domain/IndexedGeneticResourceTest.java
View file @
e9787b0e
...
...
@@ -26,7 +26,7 @@ class IndexedGeneticResourceTest {
.
withMaterialType
(
Arrays
.
asList
(
"materialType"
))
.
withPillarName
(
"pillarName"
)
.
withSpecies
(
Arrays
.
asList
(
"species"
))
.
withDescription
(
"Hello world! How\n is he/she doing? Très bien. GrapeReSeq_Illumina_20K_experiment?"
)
.
withDescription
(
"Hello
the
world! How\n is he/she doing? Très bien.
With
GrapeReSeq_Illumina_20K_experiment?"
)
.
build
();
IndexedGeneticResource
result
=
new
IndexedGeneticResource
(
resource
);
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment