---
Date: July 6, 2018
Title: BWGS - BreedWheat Genomic Selection pipeline
Version: 1.12
Date: 2018-07-06
Author:
- Gilles CHARMET
- Louis Gautier TRAN
Description: Package for Breed Wheat Genomic Selection pipeline
---
# Package « BWGS » for Windows/Linux
**Maintainer:**
**Depends:** R (>= 3.1.2)
**Imports:** BGLR, e1071, glmnet, randomForest, rrBLUP
**License:** [GPL(>=2)](./LICENSE.md)
**NeedsCompilation:** no
**Repository:** no
**Date of Publication:**
**Additional documents:**
+ [Tutorial](./BWGS_tutorial.pptx)
+ [Package Guide](./Package_BWGS_2.0_guide.docx)
## R topics documented:
0. [Tutorial](#tuto)
1. [AM - Additive relationship matrix](#am)
2. [ANO - Selectio of N markers with the lowest Pvalues in GWAS](#ano)
3. [bwgs.cv - BreedWheat Genomic Selection Cross Validation](#bwgscv)
4. [bwgs.predict - BreedWheat Genomic Selection Prediction](#bwgspredict)
5. [CHROMLD - Marker selection by LD pruning within each chromosome](#chromld)
6. [EMI - Expectation-Maximisation Imputation](#emi)
7. [inra - data from INRA breeding data, genotyping and phenotyping](#inra)
8. [MNI - MeaN Inpute](#mni)
9. [optiTRAIN - Optimization of Training set by CDmean](#optitrain)
10. [qtlSIM - Simulation of QTL](#qtlsim)
11. [RMR - Random Marker Recontruction](#rmr)
12. [RPS - Random Pop Size](#rps)
## Tutorial
```r
# script for BWGS tutorial
#YieldGBLUP <-bwgs.cv (TRAIN47K, YieldBLUE, geno.impute.method="mni", predict.method= "gblup", nFolds=10, nTimes=10 )
#YieldLASSO <-bwgs.cv (TRAIN47K, YieldBLUE, geno.impute.method="mni", predict.method= "LASSO", nFolds=10, nTimes=10 )
#YieldBA <-bwgs.cv (TRAIN47K, YieldBLUE, geno.impute.method="mni", predict.method= "BA", nFolds=10, nTimes=10 )
#YieldRKHS <-bwgs.cv (TRAIN47K, YieldBLUE, geno.impute.method="mni", predict.method= "RKHS", nFolds=10, nTimes=10 )
#YieldEGBLUP <-bwgs.cv (TRAIN47K, YieldBLUE, geno.impute.method="mni", predict.method= "EGBLUP", nFolds=10, nTimes=10 )
#compareM=cbind(YieldGBLUP$cv, YieldLASSO$cv, YieldBA$cv, YieldRKHS$cv, YieldEGBLUP$cv)
#colnames(compareM) = c("GBLUP","LASSO","BayesA","RKHS","EGBLUP")
#boxplot(compareM,xlab="Prediction method",ylab="predictive ability",main="Predictive ability of 5 methods. Yield with 47K markers")
#YieldGBLUP100 <-bwgs.cv (TRAIN47K, YieldBLUE,pop.reduct.method="RANDOM", sample.pop.size=100, geno.impute.method="mni", predict.method="gblup", nFolds=10, nTimes=10 )
#YieldGBLUP300 <-bwgs.cv (TRAIN47K, YieldBLUE,pop.reduct.method="RANDOM", sample.pop.size=300, geno.impute.method="mni", predict.method="gblup", nFolds=10, nTimes=10 )
#YieldGBLUP500 <-bwgs.cv (TRAIN47K, YieldBLUE, pop.reduct.method="RANDOM",sample.pop.size=500, geno.impute.method="mni", predict.method="gblup", nFolds=10, nTimes=10 )
#boxplot(cbind(YieldGBLUP100$cv, YieldGBLUP300$cv, YieldGBLUP500$cv, YieldGBLUP$cv))
#CompareSize=cbind(YieldGBLUP100$cv, YieldGBLUP300$cv, YieldGBLUP500$cv, YieldGBLUP$cv)
#colnames(CompareSize)=c("N=100","N=300","N=500","N=700")
#boxplot(CompareSize,xlab="Training POP size",ylab="Predictive avility",main="Effect of TRAINING POPULATION SIZE")
#testPREDICT_GBLUP=bwgs.predict(geno_train=TRAIN47K,pheno_train=YieldBLUE,geno_target=TARGET47K,MAXNA=0.2,MAF=0.05,geno.reduct.method="NULL",reduct.size="NULL",r2="NULL",pval="NULL",
#MAP="NULL",geno.impute.method="MNI",predict.method="GBLUP")
#testPREDICT_EGBLUP=bwgs.predict(geno_train=TRAIN47K,pheno_train=YieldBLUE,geno_target=TARGET47K,MAXNA=0.2,MAF=0.05,geno.reduct.method="NULL",reduct.size="NULL",r2="NULL",pval="NULL",
#MAP="NULL",geno.impute.method="MNI",predict.method="EGBLUP")
#testPREDICT_LASSO=bwgs.predict(geno_train=TRAIN47K,pheno_train=YieldBLUE,geno_target=TARGET47K,MAXNA=0.2,MAF=0.05,geno.reduct.method="NULL",reduct.size="NULL",r2="NULL",pval="NULL",
#MAP="NULL",geno.impute.method="MNI",predict.method="LASSO")
#testPREDICT_RKHS=bwgs.predict(geno_train=TRAIN47K,pheno_train=YieldBLUE,geno_target=TARGET47K,MAXNA=0.2,MAF=0.05,geno.reduct.method="NULL",reduct.size="NULL",r2="NULL",pval="NULL",
#MAP="NULL",geno.impute.method="MNI",predict.method="RKHS")
#testPREDICT_BayesA=bwgs.predict(geno_train=TRAIN47K,pheno_train=YieldBLUE,geno_target=TARGET47K,MAXNA=0.2,MAF=0.05,geno.reduct.method="NULL",reduct.size="NULL",r2="NULL",pval="NULL",
#MAP="NULL",geno.impute.method="MNI",predict.method="BA")
#ComparePRED=cbind(testPREDICT_GBLUP[,1] ,testPREDICT_BayesA[,1] ,testPREDICT_LASSO[,1], testPREDICT_RKHS[,1], testPREDICT_EGBLUP[,1])
#colnames(ComparePRED=)c("GBLEP","BauesA","LASSO","RKHS","EGBLUP")
#pairs(ComparePRED,lower.panel = panel.smooth,upper.panel = panel.cor,diag.panel=panel.hist)
TRAIN47K_NO_NA=MNI(TRAIN47K)
datasim03 <- qtlSIM (TRAIN47K_NO_NA, NQTL=100,h2=0.3)
datasim05 <- qtlSIM (TRAIN47K_NO_NA,NQTL=100,h2=0.5)
datasim08 <- qtlSIM(TRAIN47K_NO_NA,NQTL=100,h2=0.8)
cbind(rownames(datasim03$newSNP),names(datasim03$pheno))
SIM03 <- bwgs.cv (datasim03$newSNP, datasim03$pheno, geno.impute.method="MNI", predict.method ="gblup", nTimes=20, nFolds=5) #
SIM05 <- bwgs.cv (datasim05$newSNP, datasim05$pheno, geno.impute.method="MNI", predict.method="gblup", nTimes=20, nFolds=5) #
SIM08 <- bwgs.cv (datasim08$newSNP, datasim08$pheno, geno.impute.method="MNI", predict.method="gblup", nTimes20, nFolds=5) #
CompareH2=cbind (SIM03,SIM05,SIM08)
colnames(CompareH2)=c("h²=0.3","h²=0.5","h²=0.8")
boxplot(CompareH2,xlab="Simulated Trait heritability",ylab="Predictive avility",main="Effect of TRAIT heritability")
```
## AM - Additive relationship matrix
**Description**
The `AM()` function calculates the realized additive relationship matrix
with the A.mat function of library rrBLUP.
**Usage**
```r
AM(geno)
```
**Arguments**
+ geno
Matrix (n x m) of genotypes: n lines with m markers. Genotypes should be coded as \{-1, 0, 1, NA\} or The missing data are allowed or coded as NA.|
**Value**
AM returns a n x n matrix of additive relationships matrix estimated by Endelman's formula (Endelman et al 2012)
**Examples**
```r
library (bwgs)
data (inra)
geno47K_AM <- AM(geno47K)
```
**References**
Endelman, J.B., and J.-L. Jannink. 2012. Shrinkage estimation of the
realized relationship matrix.
G3:Genes, Genomes, Genetics. 2:1405-1413. doi: 10.1534/g3.112.004259
## ANO - selection of N markers with the lowest Pvalue in GWAS
**Description**
The `ANO()` function tests every marker in turn with ONE-way ANOVA
**Usage**
```r
ANO (pheno , geno, pval)
```
**Arguments**
+ pheno
A vector of trait phenotype, possibly adjusted from other factors
+ geno
Matrix (n x m) of genotypes: n lines with m markers. Genotypes should be coded as \{-1, 0, 1, NA\} or Missing data are allowed and coded as NA.
+ pval
The threshold value for selecting markers whose pvalue is \ bwgs.cv - Breed Wheat Genomic Selection Cross Validation
**Description**
The bwgs.cv function carries out cross-validation using genotypic and
phenotypic data from a reference population, with options for genotypic
matrix processing and genomic breeding value estimation.
**Usage**
```r
bwgs.cv (geno, pheno, MAXNA=0.2, MAF=0.05, pop.reduct.method="NULL",
sample.pop.size="NULL", geno.reduct.method="NULL",
reduct.marker.size="NULL", pval="NULL", r2="NULL", MAP="NULL",
geno.impute.method="NULL", predict.method="NULL", nFolds, nTimes)
```
**Arguments**
+ geno
Matrix (n x m) of genotypes for the training population: n lines with m markers. Genotypes should be coded {-1, 0, 1}. Missing data are allowed and coded as NA.
+ pheno
Vector (n x 1) of "phenotypes", i.e. observations or pre-processed, corrected values. This vector should have no missing values, otherwise missing values (NA) will be omitted in both pheno and geno. In a first step, bwgs.cv checks whether rownames(geno) match with names(pheno). If not the case, the common elements (intersect) are selected in both geno and pheno for further analyses. If a MAP file is provided, the selected set of markers are also sorted out in MAP.
+ MAXNA
The maximum proportion of missing value which is admitted for filtering marker columns in geno. Default value is 0.2
+ MAF
The minimum allele frequency for filtering marker colums in geno;default value is 0.05
+ pop.reduct.method
Method for reducing the size of the training population. Can be used for teaching purposes, no real interest in real life if the entire population is already genotyped and phenotyped. Default value is NULL (all training set used).
Proposed methods are:
- __RANDOM__: a subset of sample.pop.size is randomly selected for training the model, and the unselected part of the population is used for validation. The process is repeated nFolds \* nTimes to have the same number of replicates than with cross-validation.
- __OPTI__: the optimization algorithm based on CDmean (Rincent et al 2012) to select a subset which maximizes average CD (coefficient of determination) in the validation set. Since the process is long and has some stochastic component, it is repeated only nTimes.
+ sample.pop.size
The size of the subset of individuals in the training set (both geno and pheno) selected by pop.reduct.method if not NULL.
+ geno.impute.method
Allow missing marker data imputation using the two methods proposed in function A.mat of package rrBLUP, namely:
- __MNI__: missing data are replaced by the mean allele frequency of the marker (column in geno)
- __EMI__: missing data are replaced using an expectation-maximization methods described in function A.mat (Endelman & Janninck 2012).
Default value is NULL
Note that these imputation methods are only suited when there are a few missing value, typically in marker data from SNP chips of KasPAR. They are NOT suited for imputing marker data from low density to high density designs, and when there are MANY missing Data as typically provided by GBS. More sophisticated software (e.g. Beagles, Browning & Browning 2016) should be used before BWGS.
+ geno.reduct.method
Allows sampling a subset of markers for speeding up computing time and/or avoid introducing more noise than informative markers. Options are: |
- __RMR__ (with reduct.size): random sampling of markers (columns in geno)
- __LD__ (with r2 and MAP): enables "pruning" of markers which are in LD \> r2. Only the marker with the least missing value is kept for each pair in LD \>r2. To allow faster computation, r2 is estimated using chromosome by chromosome, so a MAP file is required.
- __ANO__ (with pval): one-way ANOVA are carried out with function lm on pheno with one marker at a time, and only markers with pvalue \< pval are kept for GEBV prediction
- __ANO+LD__ (with pval and r2, MAP is facultative) : combines a first step of maker selection with ANO, then a second step of pruning using LD option.
+ reduct.marker.size
Specifies the number of markers for the genotypic reduction using RMR (reduct.size \< m).
+ pval
p value for ANO method, 0 \< pval \< 1.
+ r2
Coefficient of linkage disequilibrium (LD). Setting 0\ bwgs.predict Breed Wheat Genomic Selection Prediction
**Description**
The `bwgs.predict()` function computes the GEBV prediction for the target
population with only genotypic Data using the options for model
selection.
**Usage**
```r
bwgs.predict (geno\_train, pheno\_train, geno\_valid, MAXNA=0.2,MAF=0.05, geno.impute.method="NULL", geno.reduct.method="NULL",reduct.size="NULL", pval="NULL", r2="NULL", MAP="NULL",predict.method="GBLUP")
```
**Arguments**
+ geno_train
Matrix (n x m) of genotypes for the training population: n lines with m markers. Genotypes should be coded as \{-1, 0, 1, NA\}. Missing data are allowed and coded as NA.
+ pheno_train
Vector (n x 1) of phenotype for the training phenotypes. This vector should have no missing values. Otherwise, missing values (NA) will be omitted in both pheno\_train and geno_train.
+ geno_valid
Matrix (z x m) of genotypes for the target population: z lines with the same m markers as in geno_train. Genotypes should be coded as \{-1, 0, 1, NA\}. Missing data are allowed and coded as NA. Other arguments are identical to those of bwgs.cv, except pop_reduct_method, nTimes and nFolds, since the prediction is run only once, using the whole training population for model estimation, then applied to the target population.
+ MAXNA
The maximum proportion of missing value which is admitted for filtering marker columns in geno. Default value is 0.2
+ MAF
The minimum allele frequency for filtering marker colums in geno;
default value is 0.05
+ random.pop.size
Allow randomly sampling a subset of individuals in the training set (both geno and pheno). Can be used for didactic purposes. Not useful in real life. Default value is NULL (all training set used)
+ geno.impute.method
Allow missing marker data imputation using the two methods proposed in function A.mat of package rrBLUP, namely:
- MNI: missing data are replaced by the mean allele frequency of the marker (column in geno)
- EMI: missing data are replaced using an expectation-maximization methods described in function A.mat. Default value is NULL
Note that these imputation methods are only suited when there are a few missing value, typically in marker data from SNP chips of KasPAR. They are NOT suited for imputing marker data from low density to high density designs, and when there are MANY missing Data as typically provided by GBS. More sophisticated software (e.g. Beagles) should be used before bwgs.
+ geno.reduct.method
Allows sampling a subset of markers for speeding up computing time and/or avoid introducing more noise than informative markers. Options are:
- RMR (with reduct.size): random sampling of markers (columns in geno)
- LD (with r2 and MAP): enables "pruning" of markers which are in LD \> r2. Only the marker with the least missing value is kept for each pair in LD\>r2. To allow faster computation, r2 is estimated using chromosome by chromosome, so a MAP file is required.
- ANO (with pval): one-way ANOVA are carried out with function lm on pheno with one marker at a time, and only markers with pvalue\ CHROMLD - Marker selection by LD pruning within each chromosome
**Description**
The CHROMLD computes a correlation matrix (r²) for all marker pairs, chromosome by chromosome. Then, in an iterative process, one marker for the pair with maximum r² is removed, until no r²\>R2seuil remains.
**Usage**
```r
CHROMLD (geno,R2seuil,MAP)
```
**Arguments**
+ geno
Matrix (n x m) of genotypes: n lines with m markers. Genotypes should be coded as \{-1, 0, 1, NA\} The missing data are allowed or coded as NA.
+ R2seuil
the threshold value for removing pairs of markers with r²\>R2seuil
+ MAP
a matrix with at least one columns entitled "chrom"
**Value**
The object `bwgs.predict` returns Matrix of dimension nx3. Columns are:
+ Predict BV: the nx1 vector of GEBVs for the validation set (rows of
+ gpredSD: Standart deviation of estimated GEBV
+ CD: coefficient of determination for each GEBV, estimated as sqrt ((1-stdev(GEBVi))²/2g)
Note that gpredSD and CD are only available for methods using the BGLR library, namely GBLUP, EGBLUP, BA,BB,BC,BL,RKHS and MKRKHS. These two columns contain NA for methods RF, RR, LASSO, EN and SVM.
**Examples**
```r
library(bwgs)
data(inra)
# Impute using EMI:
genoLD95 <- CHROMLD(geno47K, R2seuil=0.95, MAP)
```
## EMI - Expectation-Maximization Imputation
**Description**
The `EMI()` function is used to impute the missing data points using
Expectation-Maximization algorithm. It is an interative procedure in which it uses other variables to impute a value (Expectation), then checks whether that is the value most likely (Maximization). Uses A.mat function of library rrBLUP (Endelman and Janninck 2012).
Note that this method has been developed to impute missing values that are evenly distributed in the dataset, as typically produced by GBS (genotyping by sequencing e.g. Poland et al 2012). They are NOT suited for imputing marker data from low density to high density designs, and when there are MANY missing Data as typically provided by GBS.
More sophisticated software (e.g. Beagles, Browning and Browning 2016 ) should be used before bwgs.
**Usage**
```r
EMI (geno)
```
**Arguments**
+ geno
Matrix (n x m) of genotypes: n lines with m markers. Genotypes should be coded as \{-1, 0, 1, NA\}
The missing data are allowed or coded as NA.
**Value**
EMI returns a n x m matrix with missing marker data imputed by the A.mat function of package rrBLUP with impute.method = "EM".
**Examples**
```r
library(bwgs)
data(inra)
# Impute using EMI:
geno_EMI <- EMI(geno47K)
```
**References**
Browning, B. L., & Browning, S. R. (2016). Genotype Imputation with
Millions of Reference Samples. American Journal of Human Genetics,
98(1), 116--126. http://doi.org/10.1016/j.ajhg.2015.11.020
Endelman, J. B., & Jannink, J.-L. (2012). Shrinkage Estimation of the
Realized Relationship Matrix. G3: Genes\|Genomes\|Genetics, 2(11),
1405--1413. http://doi.org/10.1534/g3.112.004259
Poland, J., Endelman, J.B., Dawson, J., Rutkoski, J.E., Wu, S., Manès,
Y., Dreisigacker, S., Crossa, J., Sanchez-Villeda, H., Sorrells, M.E., &
Jannink, J. (2012). Genomic Selection in Wheat Breeding using
Genotyping-by-Sequencing.
## inra - data from INRA breeding data, genotype and phenotype
**Description**
inra data contains a set of geno47K(760 x 47839), pheno (760 x 1) and MAP47K (47839 x 3). The phenotype pheno contains adjusted genotype means for yield trait (YLD) over multi-year/location trials.
**Usage**
```r
data(inra)
```
## MNI - MeaN Impute
**Description**
The `MNI()` function replaces each missing data point by the mean of all non-missing data points in the same column.
Note that this imputation method is only suited when there are a few missing value, typically in marker data from SNP chips of KasPAR. They are NOT suited for imputing marker data from low density to high density designs, and when there are MANY missing Data as typically provided by GBS.
More sophisticated software (e.g. Beagles) should be used before BWGS
**Usage**
```r
MNI(geno)
```
**Arguments**
+ geno
Matrix (n x m) of genotypes: n lines with m markers. Genotypes should be coded as \{-1, 0, 1, NA\} The missing data are allowed or coded as NA.
**Value**
EMI returns a n x m matrix with missing marker data imputed by the A.mat function of package rrBLUP with impute.method = "mean".
**Examples**
```r
library(bwgs)
data(inra)
geno_MNI <- MNI(geno47K)
```
## optiTRAIN - Optimization of TRAINING set by CDmean
**Description**
The `optiTRAIN()` function select a subset of lines using an iterative sampling algorithm to maximize the its predictive ability, tested through cross-validation. Methods is based on the CD-mean method of Rincent et al (2012)
**Usage**
```r
optiTRAIN=function(geno, NSample=100,Nopti=1000)
```
**Arguments**
+ Geno
Matrix (n x m) of genotypes: n lines with m markers. Genotypes should be coded as \{-1, 0, 1, NA\} The missing data are allowed or coded as NA.
+ Nsample
Size of desired subsample
+ Nopti
Number of iterations to achieve optimized sample. Similar and simulated annealing. Recommended value \>1000. Lead to long computing time
**Value**
EMI returns a list with:
- CDmax is the maximum CD after optimization alrorith has run out.
- samplOptimiz is the list of the Nsample individuals which led to CDmax according to Rincent el al (2012)
- genoOptimiz is a Nsample x m matrix of genotypes for the sampleOptimiz selected individuals.
**Examples**
```r
library(bwgs)
data(inra)
# 50% missing:
Train_opti300 <- optiTRAIN (geno47K, 300, 1000)
```
**Reference**
Rincent, R., D. Laloe, S. Nicolas, T. Altmann, D. Brunel, P. Revilla, V., M Rodriguez J. Moreno-Gonzalez, A. Melchinger, E. Bauer, C.C. Schoen, N. Meyer, C. Giauffret, C. Bauland, P. Jamin, J. Laborde, H. Monod, P. Flament, A. Charcosset, and L. Moreau. 2012. Maximizing the reliability of genomic selection by optimizing the calibration set of reference individuals: Comparison of methods in two diverse groups of maize inbreds (Zea may L.) Genetics 192:715--728 doi:10.1534/genetics.112.141473
## qtlSIM - Simulation of QTL
**Description**
qtlSIM simulate quantitative trait loci on actual markers sampled from the genotypic data, with options of number of QTL and heritability value. NQTL markers are randomly sampled from the genotypic matrix, then given an effect drown from a Gaussian distribution with 0 mean and 1 variance. The quantitative trait is generated by summing the NQTL effects, then adding a random noise from a Gaussian distribution with 0 mean and variance equal to (V(QTLeffetcs) \* ((1-h2)/h2)
**Usage**
```r
qtlSIM (geno, NQLT, h2)
```
**Arguments**
+ geno
Matrix (n x m) of genotypes: n lines with m markers. Genotypes should be coded as \{-1, 0, 1, NA\}. Missing data are allowed or coded as NA.
+ NQLT
Number of desired QTL
+ h2
Desired heritability value of the simulated quantitative trait 0 $leq$ h2 $leq$ 1
**Value**
qtlSIM returns a list with:
```r
result <- list(newSNP=X,pheno=QT, TBV=TBV, Effects=Effects, h2QTL=h2QTL )
```
- newSNP is a new geno matrix without makers used as QTL
- pheno is the simulated phenotype.
- Effects is the vector of simulated effects of the NQTL sampled markers
- H2QTL is the heritability of individual QTLS, i.e. their squared Effects divided by Var (pheno)
**Examples**
```r
library(bwgs)
data(inra)
# QTL simulation:
TestQTL = qtlSIM (geno47K, NQLT=100, h2=0.3)
```
## RMR - Random Marker Reconstruction
**Description**
The `RMR()` function selects randomly a number of markers in the original genotypic matrix to generate a new genotypic matrix with keeping the orders and names of individuals.
**Usage**
```r
RMR (geno, N)
```
**Arguments**
+ geno
Matrix (n x m) of genotypes: n lines with m markers. Genotypes should be coded as \{-1, 0, 1, NA\} Missing data are allowed and coded as NA.
+ N
Number of random markers, 1 $leq$ N $leq$ m
**Value**
RMR returns a n x N genotypic matrix of N randomly selected markers
**Examples**
```r
library(bwgs)
data(inra)
# Select 5000 random markers:
Geno5K <- RMR(geno47K, 5000)
```
## RPS - Random Pop Size
**Description**
The `RPS()` function selects randomly a number of lines (or individuals) in the original genotypic matrix to generate a new genotypic matrix with keeping the orders and names of markers.
**Usage**
```r
RPS(geno, N)
```
**Arguments**
+ geno
Matrix (n x m) of genotypes: n lines with m markers. Genotypes should be coded as \{-1, 0, 1, NA\}. Missing data are allowed and coded as NA.
+ N
Number of random lines, 1 $leq$ N $leq$ n
**Value**
RPS return a N x m genotypic matrix of N randomly selected individuals
**Examples**
```r
library(bwgs)
data(inra)
# Select 200 random lines:
Sample200 <- RPS(geno47K, 200)
```