Maintenance - Mise à jour mensuelle Lundi 7 Décembre 2021 entre 7h00 et 9h00

README.md 36.2 KB
Newer Older
Helene Rimbert's avatar
Helene Rimbert committed
1
2
3
4
5
6
7
8
9
10
11
---
Date: July 6, 2018
Title: BWGS - BreedWheat Genomic Selection pipeline
Version: 1.12
Date: 2018-07-06
Author:
  - Gilles CHARMET
  - Louis Gautier TRAN
Description: Package for Breed Wheat Genomic Selection pipeline
---

Helene Rimbert's avatar
Helene Rimbert committed
12
# Package «  BWGS » for Windows/Linux  
Helene Rimbert's avatar
Helene Rimbert committed
13

Helene Rimbert's avatar
Helene Rimbert committed
14
15
**Maintainer:** <gilles.charmet@inra.fr>

Helene Rimbert's avatar
Helene Rimbert committed
16
17
18
19
**Depends:** R (>= 3.1.2)

**Imports:** BGLR, e1071, glmnet, randomForest, rrBLUP

Helene Rimbert's avatar
Helene Rimbert committed
20
**License:** [GPL(>=2)](./LICENSE.md)
Helene Rimbert's avatar
Helene Rimbert committed
21
22
23
24
25
26
27

**NeedsCompilation:** no

**Repository:** no

**Date of Publication:**

28
29
30
31
**Additional documents:**

+ [Tutorial](./BWGS_tutorial.pptx)
+ [Package Guide](./Package_BWGS_2.0_guide.docx)
Helene Rimbert's avatar
Helene Rimbert committed
32

33
34
## R topics documented:  
0. [Tutorial](#tuto)
Helene Rimbert's avatar
Helene Rimbert committed
35
36
37
38
39
40
41
42
43
44
45
46
47
1. [AM - Additive relationship matrix](#am)
2. [ANO - Selectio of N markers with the lowest Pvalues in GWAS](#ano)
3. [bwgs.cv - BreedWheat Genomic Selection Cross Validation](#bwgscv)
4. [bwgs.predict - BreedWheat Genomic Selection Prediction](#bwgspredict)
5. [CHROMLD - Marker selection by LD pruning within each chromosome](#chromld)
6. [EMI - Expectation-Maximisation Imputation](#emi)
7. [inra - data from INRA breeding data, genotyping and phenotyping](#inra)
8. [MNI - MeaN Inpute](#mni)
9. [optiTRAIN - Optimization of Training set by CDmean](#optitrain)
10. [qtlSIM - Simulation of QTL](#qtlsim)
11. [RMR - Random Marker Recontruction](#rmr)
12. [RPS - Random Pop Size](#rps)
 
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
## <a name="tuto"></a> Tutorial

```r
# script for BWGS tutorial



#YieldGBLUP <-bwgs.cv (TRAIN47K, YieldBLUE, geno.impute.method="mni", predict.method= "gblup", nFolds=10, nTimes=10 )
#YieldLASSO <-bwgs.cv (TRAIN47K, YieldBLUE, geno.impute.method="mni", predict.method= "LASSO", nFolds=10, nTimes=10 )
#YieldBA <-bwgs.cv (TRAIN47K, YieldBLUE, geno.impute.method="mni", predict.method= "BA", nFolds=10, nTimes=10 )
#YieldRKHS <-bwgs.cv (TRAIN47K, YieldBLUE, geno.impute.method="mni", predict.method= "RKHS", nFolds=10, nTimes=10 )
#YieldEGBLUP <-bwgs.cv (TRAIN47K, YieldBLUE, geno.impute.method="mni", predict.method= "EGBLUP", nFolds=10, nTimes=10 )

#compareM=cbind(YieldGBLUP$cv, YieldLASSO$cv, YieldBA$cv, YieldRKHS$cv, YieldEGBLUP$cv)
#colnames(compareM) = c("GBLUP","LASSO","BayesA","RKHS","EGBLUP")
#boxplot(compareM,xlab="Prediction method",ylab="predictive ability",main="Predictive ability of 5 methods. Yield with 47K markers")


#YieldGBLUP100 <-bwgs.cv (TRAIN47K, YieldBLUE,pop.reduct.method="RANDOM", sample.pop.size=100, geno.impute.method="mni", predict.method="gblup", nFolds=10, nTimes=10 ) 
#YieldGBLUP300 <-bwgs.cv (TRAIN47K, YieldBLUE,pop.reduct.method="RANDOM", sample.pop.size=300, geno.impute.method="mni", predict.method="gblup", nFolds=10, nTimes=10 ) 
#YieldGBLUP500 <-bwgs.cv (TRAIN47K, YieldBLUE, pop.reduct.method="RANDOM",sample.pop.size=500, geno.impute.method="mni", predict.method="gblup", nFolds=10, nTimes=10 )
 
#boxplot(cbind(YieldGBLUP100$cv, YieldGBLUP300$cv, YieldGBLUP500$cv, YieldGBLUP$cv))

#CompareSize=cbind(YieldGBLUP100$cv, YieldGBLUP300$cv, YieldGBLUP500$cv, YieldGBLUP$cv)
#colnames(CompareSize)=c("N=100","N=300","N=500","N=700")
#boxplot(CompareSize,xlab="Training POP size",ylab="Predictive avility",main="Effect of TRAINING POPULATION SIZE")

#testPREDICT_GBLUP=bwgs.predict(geno_train=TRAIN47K,pheno_train=YieldBLUE,geno_target=TARGET47K,MAXNA=0.2,MAF=0.05,geno.reduct.method="NULL",reduct.size="NULL",r2="NULL",pval="NULL",
#MAP="NULL",geno.impute.method="MNI",predict.method="GBLUP") 

#testPREDICT_EGBLUP=bwgs.predict(geno_train=TRAIN47K,pheno_train=YieldBLUE,geno_target=TARGET47K,MAXNA=0.2,MAF=0.05,geno.reduct.method="NULL",reduct.size="NULL",r2="NULL",pval="NULL",
#MAP="NULL",geno.impute.method="MNI",predict.method="EGBLUP") 

#testPREDICT_LASSO=bwgs.predict(geno_train=TRAIN47K,pheno_train=YieldBLUE,geno_target=TARGET47K,MAXNA=0.2,MAF=0.05,geno.reduct.method="NULL",reduct.size="NULL",r2="NULL",pval="NULL",
#MAP="NULL",geno.impute.method="MNI",predict.method="LASSO") 

#testPREDICT_RKHS=bwgs.predict(geno_train=TRAIN47K,pheno_train=YieldBLUE,geno_target=TARGET47K,MAXNA=0.2,MAF=0.05,geno.reduct.method="NULL",reduct.size="NULL",r2="NULL",pval="NULL",
#MAP="NULL",geno.impute.method="MNI",predict.method="RKHS") 

#testPREDICT_BayesA=bwgs.predict(geno_train=TRAIN47K,pheno_train=YieldBLUE,geno_target=TARGET47K,MAXNA=0.2,MAF=0.05,geno.reduct.method="NULL",reduct.size="NULL",r2="NULL",pval="NULL",
#MAP="NULL",geno.impute.method="MNI",predict.method="BA") 

#ComparePRED=cbind(testPREDICT_GBLUP[,1] ,testPREDICT_BayesA[,1] ,testPREDICT_LASSO[,1], testPREDICT_RKHS[,1], testPREDICT_EGBLUP[,1])
#colnames(ComparePRED=)c("GBLEP","BauesA","LASSO","RKHS","EGBLUP")
#pairs(ComparePRED,lower.panel = panel.smooth,upper.panel = panel.cor,diag.panel=panel.hist)

TRAIN47K_NO_NA=MNI(TRAIN47K)

datasim03 <- qtlSIM (TRAIN47K_NO_NA, NQTL=100,h2=0.3)
datasim05 <- qtlSIM (TRAIN47K_NO_NA,NQTL=100,h2=0.5)
datasim08 <- qtlSIM(TRAIN47K_NO_NA,NQTL=100,h2=0.8)

cbind(rownames(datasim03$newSNP),names(datasim03$pheno))

SIM03 <- bwgs.cv (datasim03$newSNP, datasim03$pheno, geno.impute.method="MNI", predict.method ="gblup", nTimes=20, nFolds=5) #
SIM05 <- bwgs.cv (datasim05$newSNP, datasim05$pheno, geno.impute.method="MNI", predict.method="gblup", nTimes=20, nFolds=5) #
SIM08 <- bwgs.cv (datasim08$newSNP, datasim08$pheno, geno.impute.method="MNI", predict.method="gblup", nTimes20, nFolds=5) #
CompareH2=cbind (SIM03,SIM05,SIM08)

colnames(CompareH2)=c("h²=0.3","h²=0.5","h²=0.8")
boxplot(CompareH2,xlab="Simulated Trait heritability",ylab="Predictive avility",main="Effect of TRAIT heritability")

```

Helene Rimbert's avatar
Helene Rimbert committed
113
## <a name="am"></a> AM - Additive relationship matrix
Helene Rimbert's avatar
Helene Rimbert committed
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153

**Description**

The `AM()` function calculates the realized additive relationship matrix
with the A.mat function of library rrBLUP.

**Usage**

```r
AM(geno)
```

**Arguments**


+ geno

Matrix (n x m) of genotypes: n lines with m markers. Genotypes should be coded as \{-1, 0, 1, NA\} or The missing data are allowed or coded as NA.|

**Value**

  AM returns a n x n matrix of additive relationships matrix estimated by Endelman's formula (Endelman et al 2012)


**Examples**

```r
library (bwgs)

data (inra)
geno47K_AM <- AM(geno47K)
```

**References**

Endelman, J.B., and J.-L. Jannink. 2012. Shrinkage estimation of the
realized relationship matrix.

G3:Genes, Genomes, Genetics. 2:1405-1413. doi: 10.1534/g3.112.004259
 
Helene Rimbert's avatar
Helene Rimbert committed
154
## <a name="ano"></a> ANO -  selection of N markers with the lowest Pvalue in GWAS
Helene Rimbert's avatar
Helene Rimbert committed
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
 

**Description**

The `ANO()` function tests every marker in turn with ONE-way ANOVA

**Usage**

```r
ANO (pheno , geno, pval)
```

**Arguments**

+ pheno

A vector of trait phenotype, possibly adjusted from other factors 

+ geno

Matrix (n x m) of genotypes: n lines with m markers. Genotypes should be coded as \{-1, 0, 1, NA\} or Missing data are allowed and  coded as NA.

+ pval

The threshold value for selecting markers whose pvalue is \<pval.

**Value**

ANO returns a n x m' matrix of markers which are significantly associated to pheno at the declared pval. m' \< m unless pval=1.
 
**Examples**

```r
Library (bwgs)

Data (inra)
geno_impote <- MNI (geno47K)
geno_shrink001 <- ANO(pheno, geno_impute, pval=0.001)
```

Helene Rimbert's avatar
Helene Rimbert committed
195
## <a name="bwgscv"></a>  bwgs.cv -  Breed Wheat Genomic Selection Cross Validation
Helene Rimbert's avatar
Helene Rimbert committed
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347

**Description**

The bwgs.cv function carries out cross-validation using genotypic and
phenotypic data from a reference population, with options for genotypic
matrix processing and genomic breeding value estimation.

**Usage**

```r
bwgs.cv (geno, pheno, MAXNA=0.2, MAF=0.05, pop.reduct.method="NULL",
sample.pop.size="NULL", geno.reduct.method="NULL",
reduct.marker.size="NULL", pval="NULL", r2="NULL", MAP="NULL",
geno.impute.method="NULL", predict.method="NULL", nFolds, nTimes)
```

**Arguments**

+ geno

Matrix (n x m) of genotypes for the training population: n lines with m markers. Genotypes should be coded {-1, 0, 1}. Missing data are allowed and coded as NA.

+ pheno

Vector (n x 1) of "phenotypes", i.e. observations or pre-processed, corrected values. This vector should have no missing values, otherwise missing values (NA) will be omitted in both pheno and geno. In a first step, bwgs.cv checks  whether rownames(geno) match with  names(pheno). If not the case, the common elements (intersect)  are selected in both geno and  pheno for further analyses. If a  MAP file is provided, the  selected set of markers are also  sorted out in MAP.

+ MAXNA

The maximum proportion of missing  value which is admitted for filtering marker columns in geno. Default value is 0.2

+ MAF

The minimum allele frequency for filtering marker colums in geno;default value is 0.05

+ pop.reduct.method

Method for reducing the size of the training population. Can be used for teaching purposes, no real interest in real life if the entire population is already genotyped and phenotyped. Default value is NULL (all training set used).  
Proposed methods are:  
   - __RANDOM__: a subset of sample.pop.size is randomly selected for training the model, and the unselected part of the population is used for validation. The process is repeated nFolds \*  nTimes to have the same number of replicates than with cross-validation.  
   - __OPTI__: the optimization algorithm based on CDmean (Rincent et al 2012) to select a subset which maximizes average CD (coefficient of determination) in the validation set. Since the process is long and has some  stochastic component, it is  repeated only nTimes.  

+ sample.pop.size

The size of the subset of individuals in the training set (both geno and pheno) selected by pop.reduct.method if not NULL.

+ geno.impute.method 

Allow missing marker data imputation using the two methods proposed in function A.mat of package rrBLUP, namely:  
  - __MNI__: missing data are replaced by the mean allele frequency of the marker (column in geno)
  - __EMI__: missing data are replaced using an expectation-maximization methods described in function A.mat (Endelman & Janninck 2012).  
Default value is NULL  
Note that these imputation methods are only suited when there are a few missing value, typically in marker data from SNP chips of KasPAR. They are NOT suited for imputing marker data from low density to high density designs, and when there are MANY missing Data as typically provided by GBS. More sophisticated software (e.g. Beagles, Browning & Browning 2016) should be used before BWGS.  

+ geno.reduct.method 

Allows sampling a subset of markers for speeding up computing time and/or avoid introducing more noise than informative markers. Options are:             |
 -   __RMR__ (with reduct.size):  random sampling of markers (columns in geno)  
 -   __LD__ (with r2 and MAP): enables "pruning" of markers which are in LD \> r2. Only the marker with the least missing value is kept for each pair in LD \>r2. To allow faster computation, r2 is estimated using chromosome by chromosome, so a MAP file is required.  
 -   __ANO__ (with pval): one-way ANOVA are carried out with function lm on pheno with one marker at a time, and only markers with pvalue \< pval are kept for GEBV prediction
 -   __ANO+LD__ (with pval and r2, MAP is facultative) : combines a first step of maker selection with ANO, then a second step of pruning using LD option.
 
+ reduct.marker.size

Specifies the number of markers for the genotypic reduction using RMR (reduct.size \< m).

+ pval

p value for ANO method, 0 \< pval \< 1.

+ r2

Coefficient of linkage disequilibrium (LD). Setting 0\<r2\<1 if the genotypic reduction method is in \{LD or ANO+LD \}.

+ MAP

A file with markers in rows and at least ONE columns with colnames= "chrom". Used for computing r2 within linkage groups.

+ predict.method

The options for genomic breeding value prediction methods. The available options are:  
  -   __GBLUP__: performs G-BLUP using a marker-based relationship matrix, implemented through BGLR R-library. Equivalent to ridge regression (RR-BLUP) of marker effects.  
  -   __EGBLUP__: performs EG-BLUP, i.e. BLUP using a "squared" relationship matrix to model epistatic 2x2 interactions, as described by Jiang & Reif (2015), using BGLR library  
  -   __RR__: ridge regression, using package glmnet. In theory, strictly equivalent to gblup.  
  -   __LASSO__: Least Absolute Shrinkage and Selection Operator is another penalized regression methods which yield more shrinked estimates than RR. Run by glmnet library.  
  -   __EN__: Elastic Net \[Zou and  Hastie, 2005\], which is a weighted combination of RR and LASSO, using glmnet library  
  
Several Bayesian methods, using the BGLR library:  
  -   __BRR__: Bayesian ridge regression: same as rr-blup, but bayesian resolution. Induces homogeneous shrinkage of all markers effects towards zero with Gaussian distribution (de los Campos et al, 2013)  
  -   __BL__: Bayesian LASSO: uses an exponential prior on marker variances priors, leading to double exponential distribution of marker effects (Park & Casella 2008)  
  -   __BA__: Bayes A uses a scaled-t prior distribution of marker effects. (Meuwissen et al 2001).  
  -   __BB__: Bayes B, uses a mixture of distribution with a point mass at zero and with a slab of non-zero marker effects with a scaled-t distribution (Habier et al 2011).  
  -   __BC__: Bayes C same as Bayes B with a slab with Gaussian distribution.  

A more detailed description of these methods can be found in Perez & de los Campos 2014 ([[http://genomics.cimmyt.org/BGLR-extdoc.pdf]{.underline}](http://genomics.cimmyt.org/BGLR-extdoc.pdf).)  

Three semi-parametric methods:  
  -   __RKHS__: reproductive kernel Hilbert space and multiple kernel MRKHS, using BGLR (Gianola and van Kaam 2008).  Based on genetic distance and a kernel function to regulate the distribution of marker effects. This methods is claimed to be effective for detecting non additive effects.  
  -   __RF__: Random forest regression, using randomForest library (Breiman, 2001, Breiman and  Cutler 2013). This methods uses regression models on tree nodes which are rooted in bootstrapping data. Supposed to be able to capture interactions between markers  
  -   __SVM__: support vector machine, run by e1071 library. For details, see Chang, Chih-Chung and Lin, Chih-Jen: LIBSVM: a library for Support Vector Machines [http://www.csie.ntu.edu.tw/\~cjlin/libsvm](http://www.csie.ntu.edu.tw/~cjlin/libsvm)  
  -   __BRNN__: Bayesian Regularization for feed-forward Neural Network, with the R-package BRNN (Gianola et al 2011). To  keep computing time in reasonable limits, the parameters for the brnn function are neurons=2 and epochs = 20.  

+ nFolds

Number of folds for the cross-validation. Smallest value recommended is nFolds = 3.

+ nTimes

Number of independent replicates for the cross-validation. Smallest value recommended is nTimes = 3.

**Value**

The class `bwgs.cv` returns a list containing:

| list item| Description|
|----------|------------|
| `$summary`                         | Summary of cross-validation, including mean and standard deviation of predictive ability (i.e. correlation between phenotype and GEBV, estimated on the validation fold, then averaged over replicates (nTimes), Time taken by the computation and number of markers |
| `$cv`                              | Vector of predictive abilities averaged over nFolds, for each of the nTimes replicates|
| `$sd`                              | Standard deviation of the nTimes predictive abilities|
| `$MSEP`                            |Square root of the mean-squared error of prediction, averaged over Ntimes |
| `$SDMSEP`                          |Standard deviation of the Square root of the mean-squared error of prediction, averaged over Ntimes  |
| `$bv`                              | Matrix of dimension nx4. Columns are: -   Real BV, i.e. pheno vector -   Predict BV: the nx1 vector of GEBVs -   gpreSD: Standart deviation of estimated GEBV -   CD: coefficient of determination for each GEBV, estimated as sqrt  ((1-stdev(GEBVi))²/σ^2^~g~) Note that gpredSD and CD are only available for methods using the BGLR library, namely GBLUP, EGBLUP, BA,BB,BC,BL,RKHS and MKRKHS. These two columns contain NA for methods RF, RR, LASSO, EN and SVM.

**Examples**

```R
library(bwgs)
                                                                       
data(inra)
                                                                       
 #GPLUB with RMR (reduct.marker.size = 5000)
                                                                       
testGBLUPRMR5000=bwgs.cv(geno47K,pheno,random.pop.size="NULL",geno. reduct.method="RMR",reduct.size=5000, geno.impute.method="MNI",predict.method="GBLUP",nFolds=10,nTimes= 50)
                                                                       
#RKHS with marker selection by ANOVA (pval=0.001)
                                                                       
testRKHSANO<bwgs.cv(geno47K,pheno,geno.reduct.method="ANO",pval=0.001,geno.impute.method="mni",predict.method="rkhs", nFolds=10, nTimes=50)
                                                                       
 #Boxplot to compare prediction methods
                                                                       
 boxplot(cbind(testGBLUPRMR3000$cv, testRKHSANO$cv))
                                                                       
 # sampling training population
                                                                       
 testBWGSRPS300=bwgs.cv(geno_shrink0001,pheno,MAXNA=0.2,MAF=0.05,pop. reduct.method="RANDOM",sample.pop.size=300,geno.reduct.method="NUL L",reduct.marker.size="NULL",pval="NULL",r2="NULL",MAP="NULL ",geno.impute.method="MNI",predict.method="GBLUP",nFolds=10,nTime s=50)
                                                                       
 # optimizing a subset for training
                                                                       
 testBWGSOPT300=bwgs.cv(geno_shrink0001,pheno,MAXNA=0.2,MAF=0.05,pop. reduct.method="OPTI",sample.pop.size=300,geno.reduct.method="NULL ",reduct.marker.size="NULL",pval="NULL",r2="NULL",MAP="NULL",geno.impute.method="MNI",predict.method="GBLUP",nFolds=10,nTimes= 5)
```

**References**

Helene Rimbert's avatar
Helene Rimbert committed
348
Breiman, L. 2001. "Random Forests". Machine Learning 45 (1): 5--32.
Helene Rimbert's avatar
Helene Rimbert committed
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397

Breiman L. and Cutler A. (2013) Breiman and Cutler's random forests for
classification and regression.

Package 'randomForest'.
http://stat-www.berkeley.edu/users/breiman/RandomForest

De los Campos, G., Hickey, J. M., Pong-Wong, R., Daetwyler, H. D., &
Calus, M. P. L. (2013). Whole-Genome Regression and Prediction Methods
Applied to Plant and Animal Breeding. Genetics, 193(2), 327--345.
[[http://doi.org/10.1534/genetics.112.143313]{.underline}](http://doi.org/10.1534/genetics.112.143313)

Gianola, D., & van Kaam, J. B. C. H. M. (2008). Reproducing Kernel
Hilbert Spaces Regression Methods for Genomic Assisted Prediction of
Quantitative Traits. Genetics, 178(4), 2289--2303.
[[http://doi.org/10.1534/genetics.107.084285]{.underline}](http://doi.org/10.1534/genetics.107.084285)

Gianola D, Okut H, Weigel KA, Rosa GJ. Predicting complex quantitative
traits with Bayesian neural networks: a case study with Jersey cows and
wheat. BMC Genetics. 2011;12:87. doi:10.1186/1471-2156-12-87.

Habier, DRL Fernando RL, Kizilkaya K and Garrick DJ(2011) Extension of
the bayesian alphabet for genomic selection. BMC Bioinformatics2011.
12:186.

Jiang, Y., & Reif, J. C. (2015). Modeling Epistasis in Genomic
Selection. Genetics, 201(2), 759--768.
http://doi.org/10.1534/genetics.115.177907

Meuwissen THE, Hayes B, Goddard ME (2001) Prediction of total genetic
value using genome-wide dense marker maps; Genetics 157:1819-1829

Pérez, P., & de los Campos, G. (2014). Genome-Wide Regression and
Prediction with the BGLR Statistical Package. Genetics, 198(2),
483--495. http://doi.org/10.1534/genetics.114.164442

Rincent, R., D. Laloe, S. Nicolas, T. Altmann, D. Brunel, P. Revilla,
V., M Rodriguez J. Moreno-Gonzalez, A. Melchinger, E. Bauer, C.C.
Schoen, N. Meyer, C. Giauffret, C. Bauland, P. Jamin, J. Laborde, H.
Monod, P. Flament, A. Charcosset, and L. Moreau. 2012. Maximizing the
reliability of genomic selection by optimizing the calibration set of
reference individuals: Comparison of methods in two diverse groups of
maize inbreds (Zea may L.) Genetics 192:715--728
doi:10.1534/genetics.112.141473

Zou H and Hastie T (2005) Regularization and variable selection via the
elastic net. J. R. Statist. Soc. 67: 301--320


Helene Rimbert's avatar
Helene Rimbert committed
398
## <a name="bwgspredict"></a>  bwgs.predict   Breed Wheat Genomic Selection Prediction
Helene Rimbert's avatar
Helene Rimbert committed
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522

**Description**

The `bwgs.predict()` function computes the GEBV prediction for the target
population with only genotypic Data using the options for model
selection.

**Usage**

```r
bwgs.predict (geno\_train, pheno\_train, geno\_valid, MAXNA=0.2,MAF=0.05, geno.impute.method="NULL", geno.reduct.method="NULL",reduct.size="NULL", pval="NULL", r2="NULL", MAP="NULL",predict.method="GBLUP")
```
**Arguments**

+ geno_train

Matrix (n x m) of genotypes for the training population: n lines with m markers. Genotypes should be coded as \{-1, 0, 1, NA\}. Missing data are allowed and coded as NA.

+ pheno_train

Vector (n x 1) of phenotype for the training phenotypes. This vector should have no missing values. Otherwise, missing values (NA) will be omitted in both pheno\_train and geno_train.

+ geno_valid

Matrix (z x m) of genotypes for the target population: z lines with the same m markers as in geno_train. Genotypes should be coded as \{-1, 0, 1, NA\}. Missing data are allowed and coded as NA. Other arguments are identical to those of bwgs.cv, except pop_reduct_method, nTimes and nFolds, since the prediction is run only once, using the whole training population for model estimation, then applied to the target population.

+ MAXNA

The maximum proportion of missing value which is admitted for filtering marker columns in geno. Default value is 0.2

+ MAF

The minimum allele frequency for filtering marker colums in geno;  
default value is 0.05

+ random.pop.size

Allow randomly sampling a subset of individuals in the training set (both geno and pheno). Can be used for didactic purposes. Not useful in real life. Default value is NULL (all training set used) 

+ geno.impute.method

Allow missing marker data imputation using the two methods proposed in function A.mat of package rrBLUP, namely:  
  -   MNI: missing data are replaced by the mean allele frequency of the marker (column in geno)  
  -   EMI: missing data are replaced using an expectation-maximization methods described in function A.mat. Default value is NULL  

Note that these imputation methods are only suited when there are a few missing value, typically in marker data from SNP chips of KasPAR. They are NOT suited for imputing marker data from low density to high density designs, and when there are MANY missing Data as typically  provided by GBS. More sophisticated software (e.g. Beagles) should be used before bwgs.

+ geno.reduct.method  

Allows sampling a subset of markers for speeding up computing time and/or avoid introducing more noise than informative markers. Options are:  
 -   RMR (with reduct.size): random sampling of markers (columns in geno)   
 -   LD (with r2 and MAP): enables "pruning" of markers which are in LD \> r2. Only the marker with the least missing value is kept for each pair in LD\>r2. To allow faster computation, r2 is estimated using chromosome by chromosome, so a MAP file is required.  
 -   ANO (with pval): one-way ANOVA are carried out with function lm on pheno with one marker at a time, and only  markers with pvalue\<pval are kept for GEBV prediction  
 -   ANO+LD (with pval and r2, MAP is facultative) : combines a first step of maker selection with ANO, then a second step of pruning using LD option.

+ reduct.size

Specifies the number of markers for the genotypic reduction using RMR (reduct.size \< m).

+ pval

p value for ANO method, 0 \< pval \< 1.

+ r2

Coefficient of linkage disequilibrium (LD). Setting 0\<r2\<1 if the genotypic reduction method is in \{LD or ANO+LD \}.

+ MAP 

A file with markers in rows ane at least ONE columns with colnames= "chrom". Used for computing r2 within linkage groups.

+ predict.method

The options for genomic breeding value prediction methods. The available options are:  
 -   GBLUP: performs G-BLUP using a marker-based relationship matrix, implemented through BGLR R-library. Equivalent to  ridge regression (RR-BLUP) of marker effects.  
 -   EGBLUP: performs EG-BLUP, i.e. BLUP using a "squared" relationship matrix to model epistatic 2x2 interactions, as described by Jiang & Reif (2015), using BGLR library   
 -   RR: ridge regression, using package glmnet. In theory, strictly equivalent to gblup.  
 -   LASSO: Least Absolute Shrinkage and Selection Operator is another penalized regression methods which yield more shrinked estimates than RR. Run by glmnet library.  
 -   EN: Elastic Net \[Zou and Hastie, 2005\], which is a weighted combination of RR and LASSO, using glmnet library  
 
 Several Bayesian methods, using the BGLR library  
 -   RR: Bayesian ridge regression: same as rr-blup, but bayesian resolution. Induces homogeneous shrinkage of all markers effects towards zero with Gaussian distribution (de los Campos et al, 2013)  
 -   BL: Bayesian LASSO: uses an exponential prior on marker variances priors, leading to double exponential distribution of marker effects (Park & Casella 2008)  
 -   BA: Bayes A uses a scaled-t prior distribution of marker effects. (MEeuwissen 2001).  
 -   BB: Bayes B, uses a mixture of distribution with a point mass at zero and with a slab of non-zero marker effects with a scaled-t distribution (Habier et al 2011).  
 -   BC: Bayes C same as Bayes B with a slab with Gaussian distribution.  
 
A more detailed description of these methods can be found in Perez & de los Campos 2014 ([[http://genomics.cimmyt.org/BGLR-extdoc.pdf]{.underline}](http://genomics.cimmyt.org/BGLR-extdoc.pdf).)

Three semi-parametric methods  

 -   RKHS: reproductive kernel Hilbert space and multiple kernel MRKHS, using BGLR (Gianola and van Kaam 2008). Based on genetic distance and a kernel function to regulate the distribution of marker effects. This methods is claimed to be effective for detecting non additive effects.  
 -   RF: Random forest regression, using randomForest library \[Breimain, 2001\]. This methods uses regression models on tree nodes which are rooted in bootstrapping data. Supposed to be able to capture interactions between markers  
 -   SVM: support vector machine, run by e1071 library. For details, seeee Chang, Chih-Chung and Lin, Chih-Jen: LIBSVM: a library for Support Vector Machines http://www.csie.ntu.edu.tw/~cjlin/libsv  

**Value**

The object `bwgs.predict` returns Matrix of dimension nx3. Columns are:

+ Predict BV: the nx1 vector of GEBVs for the validation set (rows of geno_valid)  
+ gpredSD: Standart deviation of estimated GEBV  
+ CD: coefficient of determination for each GEBV, estimated as sqrt ((1-stdev(GEBVi))²/2g)   

Note that gpredSD and CD are only available for methods using the BGLR library, namely GBLUP, EGBLUP, BA,BB,BC,BL,RKHS and MKRKHS.  
These two columns contain NA for methods RF, RR, LASSO, EN and SVM.  

**Examples**

```r
Library (bwgs)

Data (inra)

testPREDICT_GBLUP=bwgs.predict(geno_train=geno_shrink0001,pheno_train=pheno,geno_target=XTARGET,MAXNA=0.2,MAF=0.05,geno.reduct.method="NULL",reduct.size="NULL",r2="NULL",pval="NULL",MAP="NULL",geno.impute.method="MNI",predict.method="GBLUP")

testPREDICT_EGBLUP=bwgs.predict(geno_train=geno_shrink0001,pheno\train=pheno,geno_target=XTARGET,MAXNA=0.2,MAF=0.05,geno.reduct.method="NULL",reduct.size="NULL",r2="NULL",pval="NULL",MAP="NULL",geno.impute.method="MNI",predict.method="EGBLUP")
testPREDICT_BA=bwgs.predict(geno_train=geno_shrink0001,pheno_train=pheno,geno_target=XTARGET,MAXNA=0.2,MAF=0.05,geno.reduct.method="NULL",reduct.size="NULL",r2="NULL",pval="NULL",MAP="NULL",geno.impute.method="MNI",predict.method="BA")

# correlation between prediction method

cor
cbind((testPREDICT_GBLUP[,1],testPREDICT_EGBLUP[,1],testPREDICT_BA[,1]))
```

Helene Rimbert's avatar
Helene Rimbert committed
523
## <a name="chromld"></a> CHROMLD -  Marker selection by LD pruning within each chromosome
Helene Rimbert's avatar
Helene Rimbert committed
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570


**Description**

The CHROMLD computes a correlation matrix (r²) for all marker pairs, chromosome by chromosome. Then, in an iterative process, one marker for the pair with maximum r² is removed, until no r²\>R2seuil remains.

**Usage**

```r
CHROMLD (geno,R2seuil,MAP)
```

**Arguments**

+ geno

 Matrix (n x m) of genotypes: n lines with m markers. Genotypes should be coded as \{-1, 0, 1, NA\} The missing data are allowed or coded as NA.  
+ R2seuil

the threshold value for removing pairs of markers with r²\>R2seuil  

+ MAP

a matrix with at least one columns entitled "chrom"

**Value**

The object `bwgs.predict` returns Matrix of dimension nx3. Columns are:

+ Predict BV: the nx1 vector of GEBVs for the validation set (rows of  
+ gpredSD: Standart deviation of estimated GEBV  
+ CD: coefficient of determination for each GEBV, estimated as sqrt ((1-stdev(GEBVi))²/2g)  

Note that gpredSD and CD are only available for methods using the BGLR library, namely GBLUP, EGBLUP, BA,BB,BC,BL,RKHS and MKRKHS. These two columns contain NA for methods RF, RR, LASSO, EN and SVM.

**Examples**

```r
library(bwgs)

data(inra)

# Impute using EMI:

genoLD95 <- CHROMLD(geno47K, R2seuil=0.95, MAP)
```

Helene Rimbert's avatar
Helene Rimbert committed
571
## <a name="emi"></a>  EMI -  Expectation-Maximization Imputation
Helene Rimbert's avatar
Helene Rimbert committed
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627


**Description**

The `EMI()` function is used to impute the missing data points using  

Expectation-Maximization algorithm. It is an interative procedure in which it uses other variables to impute a value (Expectation), then checks whether that is the value most likely (Maximization). Uses A.mat function of library rrBLUP (Endelman and Janninck 2012).  

Note that this method has been developed to impute missing values that are evenly distributed in the dataset, as typically produced by GBS (genotyping by sequencing e.g. Poland et al 2012). They are NOT suited for imputing marker data from low density to high density designs, and when there are MANY missing Data as typically provided by GBS.  

More sophisticated software (e.g. Beagles, Browning and Browning 2016 ) should be used before bwgs.

**Usage**

```r
EMI (geno)
```

**Arguments**

+ geno

Matrix (n x m) of genotypes: n lines with m markers. Genotypes should be coded as \{-1, 0, 1, NA\}  
The missing data are allowed or coded as NA.

**Value**

EMI returns a n x m matrix with missing marker data imputed by the A.mat function of package rrBLUP with impute.method = "EM".

**Examples**

```r
library(bwgs)

data(inra)

# Impute using EMI:

geno_EMI <- EMI(geno47K)
```

**References**

Browning, B. L., & Browning, S. R. (2016). Genotype Imputation with
Millions of Reference Samples. American Journal of Human Genetics,
98(1), 116--126. http://doi.org/10.1016/j.ajhg.2015.11.020

Endelman, J. B., & Jannink, J.-L. (2012). Shrinkage Estimation of the
Realized Relationship Matrix. G3: Genes\|Genomes\|Genetics, 2(11),
1405--1413. http://doi.org/10.1534/g3.112.004259

Poland, J., Endelman, J.B., Dawson, J., Rutkoski, J.E., Wu, S., Manès,
Y., Dreisigacker, S., Crossa, J., Sanchez-Villeda, H., Sorrells, M.E., &
Jannink, J. (2012). Genomic Selection in Wheat Breeding using
Genotyping-by-Sequencing.

Helene Rimbert's avatar
Helene Rimbert committed
628
## <a name="inra"></a> inra -  data from INRA breeding data, genotype and phenotype
Helene Rimbert's avatar
Helene Rimbert committed
629
630
631
632


**Description**

633
inra data contains a set of geno47K(760 x 47839), pheno (760 x 1) and MAP47K (47839 x 3). The phenotype pheno contains adjusted genotype means for yield trait (YLD) over multi-year/location trials.  
Helene Rimbert's avatar
Helene Rimbert committed
634
635
636
637
638
639
640

**Usage**

```r
data(inra)
```

Helene Rimbert's avatar
Helene Rimbert committed
641
## <a name="mni"></a>  MNI -  MeaN Impute
Helene Rimbert's avatar
Helene Rimbert committed
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677


**Description**

The `MNI()` function replaces each missing data point by the mean of all non-missing data points in the same column.  

Note that this imputation method is only suited when there are a few missing value, typically in marker data from SNP chips of KasPAR. They are NOT suited for imputing marker data from low density to high density designs, and when there are MANY missing Data as typically provided by GBS.  

More sophisticated software (e.g. Beagles) should be used before BWGS  

**Usage**

```r
MNI(geno)
```

**Arguments**

+ geno

Matrix (n x m) of genotypes: n lines with m markers. Genotypes should be coded as \{-1, 0, 1, NA\} The missing data are allowed or coded as NA.

**Value**

EMI returns a n x m matrix with missing marker data imputed by the A.mat function of package rrBLUP with impute.method = "mean".

**Examples**

```r
library(bwgs)

data(inra)

geno_MNI <- MNI(geno47K)
```

Helene Rimbert's avatar
Helene Rimbert committed
678
## <a name="optitrain"></a> optiTRAIN -  Optimization of TRAINING set by CDmean
Helene Rimbert's avatar
Helene Rimbert committed
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726

**Description**

The `optiTRAIN()` function select a subset of lines using an iterative sampling algorithm to maximize the its predictive ability, tested through cross-validation. Methods is based on the CD-mean method of Rincent et al (2012)

**Usage**

```r
optiTRAIN=function(geno, NSample=100,Nopti=1000)
```

**Arguments**

+ Geno

Matrix (n x m) of genotypes: n lines with m markers. Genotypes should be coded as \{-1, 0, 1, NA\} The missing data are allowed or  coded as NA.

+ Nsample

Size of desired subsample

+ Nopti

Number of iterations to achieve optimized sample. Similar and simulated annealing. Recommended value \>1000. Lead to long computing time

**Value**

EMI returns a list with:

 -   CDmax is the maximum CD after optimization alrorith has run out.  
 -   samplOptimiz is the list of the Nsample individuals which led to CDmax according to Rincent el al (2012)  
 -   genoOptimiz is a Nsample x m matrix of genotypes for the sampleOptimiz selected individuals.

**Examples**
```r
library(bwgs)

data(inra)

# 50% missing:

Train_opti300 <- optiTRAIN (geno47K, 300, 1000)
```

**Reference**

Rincent, R., D. Laloe, S. Nicolas, T. Altmann, D. Brunel, P. Revilla, V., M Rodriguez J. Moreno-Gonzalez, A. Melchinger, E. Bauer, C.C. Schoen, N. Meyer, C. Giauffret, C. Bauland, P. Jamin, J. Laborde, H. Monod, P. Flament, A. Charcosset, and L. Moreau. 2012. Maximizing the reliability of genomic selection by optimizing the calibration set of reference individuals: Comparison of methods in two diverse groups of maize inbreds (Zea may L.) Genetics 192:715--728 doi:10.1534/genetics.112.141473

Helene Rimbert's avatar
Helene Rimbert committed
727
## <a name="qtlsim"></a> qtlSIM -  Simulation of QTL
Helene Rimbert's avatar
Helene Rimbert committed
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776


**Description**

qtlSIM simulate quantitative trait loci on actual markers sampled from the genotypic data, with options of number of QTL and heritability value. NQTL markers are randomly sampled from the genotypic matrix, then given an effect drown from a Gaussian distribution with 0 mean and 1 variance. The quantitative trait is generated by summing the NQTL effects, then adding a random noise from a Gaussian distribution with 0 mean and variance equal to (V(QTLeffetcs) \* ((1-h2)/h2)

**Usage**

```r
qtlSIM (geno, NQLT, h2)
```

**Arguments**

+ geno

Matrix (n x m) of genotypes: n lines with m markers. Genotypes should be coded as \{-1, 0, 1, NA\}. Missing data are allowed or coded as NA.

+ NQLT

Number of desired QTL

+ h2

Desired heritability value of the simulated quantitative trait 0 $leq$ h2 $leq$ 1

**Value**

qtlSIM returns a list with:

```r
result <- list(newSNP=X,pheno=QT, TBV=TBV, Effects=Effects, h2QTL=h2QTL )
```
 -   newSNP is a new geno matrix without makers used as QTL  
 -   pheno is the simulated phenotype.  
 -   Effects is the vector of simulated effects of the NQTL sampled markers  
 -   H2QTL is the heritability of individual QTLS, i.e. their squared Effects divided by Var (pheno)

**Examples**
```r
library(bwgs)

data(inra)

# QTL simulation:

TestQTL = qtlSIM (geno47K, NQLT=100, h2=0.3)
```

Helene Rimbert's avatar
Helene Rimbert committed
777
## <a name="rmr"></a>  RMR -  Random Marker Reconstruction
Helene Rimbert's avatar
Helene Rimbert committed
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
 

**Description**

The `RMR()` function selects randomly a number of markers in the original genotypic matrix to generate a new genotypic matrix with keeping the orders and names of individuals.

**Usage**

```r
RMR (geno, N)
```

**Arguments**

+ geno

Matrix (n x m) of genotypes: n lines with m markers. Genotypes should be coded as \{-1, 0, 1, NA\} Missing data are allowed and coded as NA.

+  N

Number of random markers, 1 $leq$ N $leq$ m

**Value**

RMR returns a n x N genotypic matrix of N randomly selected markers


**Examples**
```r
library(bwgs)

data(inra)

# Select 5000 random markers:

Geno5K <- RMR(geno47K, 5000)
```

Helene Rimbert's avatar
Helene Rimbert committed
816
## <a name="rps"></a>  RPS -  Random Pop Size
Helene Rimbert's avatar
Helene Rimbert committed
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849

**Description**

The `RPS()` function selects randomly a number of lines (or individuals) in the original genotypic matrix to generate a new genotypic matrix with keeping the orders and names of markers.

**Usage**
```r
RPS(geno, N)
```

**Arguments**

+  geno

Matrix (n x m) of genotypes: n lines with m markers. Genotypes should be coded as \{-1, 0, 1, NA\}. Missing data are allowed and coded as NA.
  
+  N

Number of random lines, 1 $leq$ N $leq$ n

**Value**

RPS return a N x m genotypic matrix of N randomly selected individuals

**Examples**
```r
library(bwgs)

data(inra)

# Select 200 random lines:

Sample200 <- RPS(geno47K, 200)
850
```