Skip to content
Snippets Groups Projects
Date: July 6, 2018
Title: BWGS - BreedWheat Genomic Selection pipeline
Version: 1.12
Date: 2018-07-06
Author:
  - Gilles CHARMET
  - Louis Gautier TRAN
Description: Package for Breed Wheat Genomic Selection pipeline

Package «  BWGS » for Windows/Linux

Maintainer: gilles.charmet@inra.fr

Depends: R (>= 3.1.2)

Imports: BGLR, e1071, glmnet, randomForest, rrBLUP

License: GPL(>=2)

NeedsCompilation: no

Repository: no

Date of Publication:

Additional documents:

R topics documented:

  1. Tutorial
  2. AM - Additive relationship matrix
  3. ANO - Selectio of N markers with the lowest Pvalues in GWAS
  4. bwgs.cv - BreedWheat Genomic Selection Cross Validation
  5. bwgs.predict - BreedWheat Genomic Selection Prediction
  6. CHROMLD - Marker selection by LD pruning within each chromosome
  7. EMI - Expectation-Maximisation Imputation
  8. inra - data from INRA breeding data, genotyping and phenotyping
  9. MNI - MeaN Inpute
  10. optiTRAIN - Optimization of Training set by CDmean
  11. qtlSIM - Simulation of QTL
  12. RMR - Random Marker Recontruction
  13. RPS - Random Pop Size

Tutorial

# script for BWGS tutorial



#YieldGBLUP <-bwgs.cv (TRAIN47K, YieldBLUE, geno.impute.method="mni", predict.method= "gblup", nFolds=10, nTimes=10 )
#YieldLASSO <-bwgs.cv (TRAIN47K, YieldBLUE, geno.impute.method="mni", predict.method= "LASSO", nFolds=10, nTimes=10 )
#YieldBA <-bwgs.cv (TRAIN47K, YieldBLUE, geno.impute.method="mni", predict.method= "BA", nFolds=10, nTimes=10 )
#YieldRKHS <-bwgs.cv (TRAIN47K, YieldBLUE, geno.impute.method="mni", predict.method= "RKHS", nFolds=10, nTimes=10 )
#YieldEGBLUP <-bwgs.cv (TRAIN47K, YieldBLUE, geno.impute.method="mni", predict.method= "EGBLUP", nFolds=10, nTimes=10 )

#compareM=cbind(YieldGBLUP$cv, YieldLASSO$cv, YieldBA$cv, YieldRKHS$cv, YieldEGBLUP$cv)
#colnames(compareM) = c("GBLUP","LASSO","BayesA","RKHS","EGBLUP")
#boxplot(compareM,xlab="Prediction method",ylab="predictive ability",main="Predictive ability of 5 methods. Yield with 47K markers")


#YieldGBLUP100 <-bwgs.cv (TRAIN47K, YieldBLUE,pop.reduct.method="RANDOM", sample.pop.size=100, geno.impute.method="mni", predict.method="gblup", nFolds=10, nTimes=10 ) 
#YieldGBLUP300 <-bwgs.cv (TRAIN47K, YieldBLUE,pop.reduct.method="RANDOM", sample.pop.size=300, geno.impute.method="mni", predict.method="gblup", nFolds=10, nTimes=10 ) 
#YieldGBLUP500 <-bwgs.cv (TRAIN47K, YieldBLUE, pop.reduct.method="RANDOM",sample.pop.size=500, geno.impute.method="mni", predict.method="gblup", nFolds=10, nTimes=10 )
 
#boxplot(cbind(YieldGBLUP100$cv, YieldGBLUP300$cv, YieldGBLUP500$cv, YieldGBLUP$cv))

#CompareSize=cbind(YieldGBLUP100$cv, YieldGBLUP300$cv, YieldGBLUP500$cv, YieldGBLUP$cv)
#colnames(CompareSize)=c("N=100","N=300","N=500","N=700")
#boxplot(CompareSize,xlab="Training POP size",ylab="Predictive avility",main="Effect of TRAINING POPULATION SIZE")

#testPREDICT_GBLUP=bwgs.predict(geno_train=TRAIN47K,pheno_train=YieldBLUE,geno_target=TARGET47K,MAXNA=0.2,MAF=0.05,geno.reduct.method="NULL",reduct.size="NULL",r2="NULL",pval="NULL",
#MAP="NULL",geno.impute.method="MNI",predict.method="GBLUP") 

#testPREDICT_EGBLUP=bwgs.predict(geno_train=TRAIN47K,pheno_train=YieldBLUE,geno_target=TARGET47K,MAXNA=0.2,MAF=0.05,geno.reduct.method="NULL",reduct.size="NULL",r2="NULL",pval="NULL",
#MAP="NULL",geno.impute.method="MNI",predict.method="EGBLUP") 

#testPREDICT_LASSO=bwgs.predict(geno_train=TRAIN47K,pheno_train=YieldBLUE,geno_target=TARGET47K,MAXNA=0.2,MAF=0.05,geno.reduct.method="NULL",reduct.size="NULL",r2="NULL",pval="NULL",
#MAP="NULL",geno.impute.method="MNI",predict.method="LASSO") 

#testPREDICT_RKHS=bwgs.predict(geno_train=TRAIN47K,pheno_train=YieldBLUE,geno_target=TARGET47K,MAXNA=0.2,MAF=0.05,geno.reduct.method="NULL",reduct.size="NULL",r2="NULL",pval="NULL",
#MAP="NULL",geno.impute.method="MNI",predict.method="RKHS") 

#testPREDICT_BayesA=bwgs.predict(geno_train=TRAIN47K,pheno_train=YieldBLUE,geno_target=TARGET47K,MAXNA=0.2,MAF=0.05,geno.reduct.method="NULL",reduct.size="NULL",r2="NULL",pval="NULL",
#MAP="NULL",geno.impute.method="MNI",predict.method="BA") 

#ComparePRED=cbind(testPREDICT_GBLUP[,1] ,testPREDICT_BayesA[,1] ,testPREDICT_LASSO[,1], testPREDICT_RKHS[,1], testPREDICT_EGBLUP[,1])
#colnames(ComparePRED=)c("GBLEP","BauesA","LASSO","RKHS","EGBLUP")
#pairs(ComparePRED,lower.panel = panel.smooth,upper.panel = panel.cor,diag.panel=panel.hist)

TRAIN47K_NO_NA=MNI(TRAIN47K)

datasim03 <- qtlSIM (TRAIN47K_NO_NA, NQTL=100,h2=0.3)
datasim05 <- qtlSIM (TRAIN47K_NO_NA,NQTL=100,h2=0.5)
datasim08 <- qtlSIM(TRAIN47K_NO_NA,NQTL=100,h2=0.8)

cbind(rownames(datasim03$newSNP),names(datasim03$pheno))

SIM03 <- bwgs.cv (datasim03$newSNP, datasim03$pheno, geno.impute.method="MNI", predict.method ="gblup", nTimes=20, nFolds=5) #
SIM05 <- bwgs.cv (datasim05$newSNP, datasim05$pheno, geno.impute.method="MNI", predict.method="gblup", nTimes=20, nFolds=5) #
SIM08 <- bwgs.cv (datasim08$newSNP, datasim08$pheno, geno.impute.method="MNI", predict.method="gblup", nTimes20, nFolds=5) #
CompareH2=cbind (SIM03,SIM05,SIM08)

colnames(CompareH2)=c("h²=0.3","h²=0.5","h²=0.8")
boxplot(CompareH2,xlab="Simulated Trait heritability",ylab="Predictive avility",main="Effect of TRAIT heritability")

AM - Additive relationship matrix

Description

The AM() function calculates the realized additive relationship matrix with the A.mat function of library rrBLUP.

Usage

AM(geno)

Arguments

  • geno

Matrix (n x m) of genotypes: n lines with m markers. Genotypes should be coded as {-1, 0, 1, NA} or The missing data are allowed or coded as NA.|

Value

AM returns a n x n matrix of additive relationships matrix estimated by Endelman's formula (Endelman et al 2012)

Examples

library (bwgs)

data (inra)
geno47K_AM <- AM(geno47K)

References

Endelman, J.B., and J.-L. Jannink. 2012. Shrinkage estimation of the realized relationship matrix.

G3:Genes, Genomes, Genetics. 2:1405-1413. doi: 10.1534/g3.112.004259

ANO - selection of N markers with the lowest Pvalue in GWAS

Description

The ANO() function tests every marker in turn with ONE-way ANOVA

Usage

ANO (pheno , geno, pval)

Arguments

  • pheno

A vector of trait phenotype, possibly adjusted from other factors

  • geno

Matrix (n x m) of genotypes: n lines with m markers. Genotypes should be coded as {-1, 0, 1, NA} or Missing data are allowed and coded as NA.

  • pval

The threshold value for selecting markers whose pvalue is <pval.

Value

ANO returns a n x m' matrix of markers which are significantly associated to pheno at the declared pval. m' < m unless pval=1.

Examples

Library (bwgs)

Data (inra)
geno_impote <- MNI (geno47K)
geno_shrink001 <- ANO(pheno, geno_impute, pval=0.001)

bwgs.cv - Breed Wheat Genomic Selection Cross Validation

Description

The bwgs.cv function carries out cross-validation using genotypic and phenotypic data from a reference population, with options for genotypic matrix processing and genomic breeding value estimation.

Usage

bwgs.cv (geno, pheno, MAXNA=0.2, MAF=0.05, pop.reduct.method="NULL",
sample.pop.size="NULL", geno.reduct.method="NULL",
reduct.marker.size="NULL", pval="NULL", r2="NULL", MAP="NULL",
geno.impute.method="NULL", predict.method="NULL", nFolds, nTimes)

Arguments

  • geno

Matrix (n x m) of genotypes for the training population: n lines with m markers. Genotypes should be coded {-1, 0, 1}. Missing data are allowed and coded as NA.

  • pheno

Vector (n x 1) of "phenotypes", i.e. observations or pre-processed, corrected values. This vector should have no missing values, otherwise missing values (NA) will be omitted in both pheno and geno. In a first step, bwgs.cv checks whether rownames(geno) match with names(pheno). If not the case, the common elements (intersect) are selected in both geno and pheno for further analyses. If a MAP file is provided, the selected set of markers are also sorted out in MAP.

  • MAXNA

The maximum proportion of missing value which is admitted for filtering marker columns in geno. Default value is 0.2

  • MAF

The minimum allele frequency for filtering marker colums in geno;default value is 0.05

  • pop.reduct.method

Method for reducing the size of the training population. Can be used for teaching purposes, no real interest in real life if the entire population is already genotyped and phenotyped. Default value is NULL (all training set used).
Proposed methods are:

  • RANDOM: a subset of sample.pop.size is randomly selected for training the model, and the unselected part of the population is used for validation. The process is repeated nFolds * nTimes to have the same number of replicates than with cross-validation.
  • OPTI: the optimization algorithm based on CDmean (Rincent et al 2012) to select a subset which maximizes average CD (coefficient of determination) in the validation set. Since the process is long and has some stochastic component, it is repeated only nTimes.
  • sample.pop.size

The size of the subset of individuals in the training set (both geno and pheno) selected by pop.reduct.method if not NULL.

  • geno.impute.method

Allow missing marker data imputation using the two methods proposed in function A.mat of package rrBLUP, namely:

  • MNI: missing data are replaced by the mean allele frequency of the marker (column in geno)
  • EMI: missing data are replaced using an expectation-maximization methods described in function A.mat (Endelman & Janninck 2012).
    Default value is NULL
    Note that these imputation methods are only suited when there are a few missing value, typically in marker data from SNP chips of KasPAR. They are NOT suited for imputing marker data from low density to high density designs, and when there are MANY missing Data as typically provided by GBS. More sophisticated software (e.g. Beagles, Browning & Browning 2016) should be used before BWGS.
  • geno.reduct.method

Allows sampling a subset of markers for speeding up computing time and/or avoid introducing more noise than informative markers. Options are: |

  • RMR (with reduct.size): random sampling of markers (columns in geno)
  • LD (with r2 and MAP): enables "pruning" of markers which are in LD > r2. Only the marker with the least missing value is kept for each pair in LD >r2. To allow faster computation, r2 is estimated using chromosome by chromosome, so a MAP file is required.
  • ANO (with pval): one-way ANOVA are carried out with function lm on pheno with one marker at a time, and only markers with pvalue < pval are kept for GEBV prediction
  • ANO+LD (with pval and r2, MAP is facultative) : combines a first step of maker selection with ANO, then a second step of pruning using LD option.
  • reduct.marker.size

Specifies the number of markers for the genotypic reduction using RMR (reduct.size < m).

  • pval

p value for ANO method, 0 < pval < 1.

  • r2

Coefficient of linkage disequilibrium (LD). Setting 0<r2<1 if the genotypic reduction method is in {LD or ANO+LD }.

  • MAP

A file with markers in rows and at least ONE columns with colnames= "chrom". Used for computing r2 within linkage groups.

  • predict.method

The options for genomic breeding value prediction methods. The available options are:

  • GBLUP: performs G-BLUP using a marker-based relationship matrix, implemented through BGLR R-library. Equivalent to ridge regression (RR-BLUP) of marker effects.
  • EGBLUP: performs EG-BLUP, i.e. BLUP using a "squared" relationship matrix to model epistatic 2x2 interactions, as described by Jiang & Reif (2015), using BGLR library
  • RR: ridge regression, using package glmnet. In theory, strictly equivalent to gblup.
  • LASSO: Least Absolute Shrinkage and Selection Operator is another penalized regression methods which yield more shrinked estimates than RR. Run by glmnet library.
  • EN: Elastic Net [Zou and Hastie, 2005], which is a weighted combination of RR and LASSO, using glmnet library

Several Bayesian methods, using the BGLR library:

  • BRR: Bayesian ridge regression: same as rr-blup, but bayesian resolution. Induces homogeneous shrinkage of all markers effects towards zero with Gaussian distribution (de los Campos et al, 2013)
  • BL: Bayesian LASSO: uses an exponential prior on marker variances priors, leading to double exponential distribution of marker effects (Park & Casella 2008)
  • BA: Bayes A uses a scaled-t prior distribution of marker effects. (Meuwissen et al 2001).
  • BB: Bayes B, uses a mixture of distribution with a point mass at zero and with a slab of non-zero marker effects with a scaled-t distribution (Habier et al 2011).
  • BC: Bayes C same as Bayes B with a slab with Gaussian distribution.

A more detailed description of these methods can be found in Perez & de los Campos 2014 ([http://genomics.cimmyt.org/BGLR-extdoc.pdf]{.underline}.)

Three semi-parametric methods:

  • RKHS: reproductive kernel Hilbert space and multiple kernel MRKHS, using BGLR (Gianola and van Kaam 2008). Based on genetic distance and a kernel function to regulate the distribution of marker effects. This methods is claimed to be effective for detecting non additive effects.
  • RF: Random forest regression, using randomForest library (Breiman, 2001, Breiman and Cutler 2013). This methods uses regression models on tree nodes which are rooted in bootstrapping data. Supposed to be able to capture interactions between markers
  • SVM: support vector machine, run by e1071 library. For details, see Chang, Chih-Chung and Lin, Chih-Jen: LIBSVM: a library for Support Vector Machines http://www.csie.ntu.edu.tw/~cjlin/libsvm
  • BRNN: Bayesian Regularization for feed-forward Neural Network, with the R-package BRNN (Gianola et al 2011). To keep computing time in reasonable limits, the parameters for the brnn function are neurons=2 and epochs = 20.
  • nFolds

Number of folds for the cross-validation. Smallest value recommended is nFolds = 3.

  • nTimes

Number of independent replicates for the cross-validation. Smallest value recommended is nTimes = 3.

Value

The class bwgs.cv returns a list containing:

list item Description
$summary Summary of cross-validation, including mean and standard deviation of predictive ability (i.e. correlation between phenotype and GEBV, estimated on the validation fold, then averaged over replicates (nTimes), Time taken by the computation and number of markers
$cv Vector of predictive abilities averaged over nFolds, for each of the nTimes replicates
$sd Standard deviation of the nTimes predictive abilities
$MSEP Square root of the mean-squared error of prediction, averaged over Ntimes
$SDMSEP Standard deviation of the Square root of the mean-squared error of prediction, averaged over Ntimes
$bv Matrix of dimension nx4. Columns are: - Real BV, i.e. pheno vector - Predict BV: the nx1 vector of GEBVs - gpreSD: Standart deviation of estimated GEBV - CD: coefficient of determination for each GEBV, estimated as sqrt ((1-stdev(GEBVi))²/σ^2^g) Note that gpredSD and CD are only available for methods using the BGLR library, namely GBLUP, EGBLUP, BA,BB,BC,BL,RKHS and MKRKHS. These two columns contain NA for methods RF, RR, LASSO, EN and SVM.

Examples

library(bwgs)
                                                                       
data(inra)
                                                                       
 #GPLUB with RMR (reduct.marker.size = 5000)
                                                                       
testGBLUPRMR5000=bwgs.cv(geno47K,pheno,random.pop.size="NULL",geno. reduct.method="RMR",reduct.size=5000, geno.impute.method="MNI",predict.method="GBLUP",nFolds=10,nTimes= 50)
                                                                       
#RKHS with marker selection by ANOVA (pval=0.001)
                                                                       
testRKHSANO<bwgs.cv(geno47K,pheno,geno.reduct.method="ANO",pval=0.001,geno.impute.method="mni",predict.method="rkhs", nFolds=10, nTimes=50)
                                                                       
 #Boxplot to compare prediction methods
                                                                       
 boxplot(cbind(testGBLUPRMR3000$cv, testRKHSANO$cv))
                                                                       
 # sampling training population
                                                                       
 testBWGSRPS300=bwgs.cv(geno_shrink0001,pheno,MAXNA=0.2,MAF=0.05,pop. reduct.method="RANDOM",sample.pop.size=300,geno.reduct.method="NUL L",reduct.marker.size="NULL",pval="NULL",r2="NULL",MAP="NULL ",geno.impute.method="MNI",predict.method="GBLUP",nFolds=10,nTime s=50)
                                                                       
 # optimizing a subset for training
                                                                       
 testBWGSOPT300=bwgs.cv(geno_shrink0001,pheno,MAXNA=0.2,MAF=0.05,pop. reduct.method="OPTI",sample.pop.size=300,geno.reduct.method="NULL ",reduct.marker.size="NULL",pval="NULL",r2="NULL",MAP="NULL",geno.impute.method="MNI",predict.method="GBLUP",nFolds=10,nTimes= 5)

References

Breiman, L. 2001. "Random Forests". Machine Learning 45 (1): 5--32.

Breiman L. and Cutler A. (2013) Breiman and Cutler's random forests for classification and regression.

Package 'randomForest'. http://stat-www.berkeley.edu/users/breiman/RandomForest

De los Campos, G., Hickey, J. M., Pong-Wong, R., Daetwyler, H. D., & Calus, M. P. L. (2013). Whole-Genome Regression and Prediction Methods Applied to Plant and Animal Breeding. Genetics, 193(2), 327--345. [http://doi.org/10.1534/genetics.112.143313]{.underline}

Gianola, D., & van Kaam, J. B. C. H. M. (2008). Reproducing Kernel Hilbert Spaces Regression Methods for Genomic Assisted Prediction of Quantitative Traits. Genetics, 178(4), 2289--2303. [http://doi.org/10.1534/genetics.107.084285]{.underline}

Gianola D, Okut H, Weigel KA, Rosa GJ. Predicting complex quantitative traits with Bayesian neural networks: a case study with Jersey cows and wheat. BMC Genetics. 2011;12:87. doi:10.1186/1471-2156-12-87.

Habier, DRL Fernando RL, Kizilkaya K and Garrick DJ(2011) Extension of the bayesian alphabet for genomic selection. BMC Bioinformatics2011. 12:186.

Jiang, Y., & Reif, J. C. (2015). Modeling Epistasis in Genomic Selection. Genetics, 201(2), 759--768. http://doi.org/10.1534/genetics.115.177907

Meuwissen THE, Hayes B, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps; Genetics 157:1819-1829

Pérez, P., & de los Campos, G. (2014). Genome-Wide Regression and Prediction with the BGLR Statistical Package. Genetics, 198(2), 483--495. http://doi.org/10.1534/genetics.114.164442

Rincent, R., D. Laloe, S. Nicolas, T. Altmann, D. Brunel, P. Revilla, V., M Rodriguez J. Moreno-Gonzalez, A. Melchinger, E. Bauer, C.C. Schoen, N. Meyer, C. Giauffret, C. Bauland, P. Jamin, J. Laborde, H. Monod, P. Flament, A. Charcosset, and L. Moreau. 2012. Maximizing the reliability of genomic selection by optimizing the calibration set of reference individuals: Comparison of methods in two diverse groups of maize inbreds (Zea may L.) Genetics 192:715--728 doi:10.1534/genetics.112.141473

Zou H and Hastie T (2005) Regularization and variable selection via the elastic net. J. R. Statist. Soc. 67: 301--320

bwgs.predict Breed Wheat Genomic Selection Prediction

Description

The bwgs.predict() function computes the GEBV prediction for the target population with only genotypic Data using the options for model selection.

Usage

bwgs.predict (geno\_train, pheno\_train, geno\_valid, MAXNA=0.2,MAF=0.05, geno.impute.method="NULL", geno.reduct.method="NULL",reduct.size="NULL", pval="NULL", r2="NULL", MAP="NULL",predict.method="GBLUP")

Arguments

  • geno_train

Matrix (n x m) of genotypes for the training population: n lines with m markers. Genotypes should be coded as {-1, 0, 1, NA}. Missing data are allowed and coded as NA.

  • pheno_train

Vector (n x 1) of phenotype for the training phenotypes. This vector should have no missing values. Otherwise, missing values (NA) will be omitted in both pheno_train and geno_train.

  • geno_valid

Matrix (z x m) of genotypes for the target population: z lines with the same m markers as in geno_train. Genotypes should be coded as {-1, 0, 1, NA}. Missing data are allowed and coded as NA. Other arguments are identical to those of bwgs.cv, except pop_reduct_method, nTimes and nFolds, since the prediction is run only once, using the whole training population for model estimation, then applied to the target population.

  • MAXNA

The maximum proportion of missing value which is admitted for filtering marker columns in geno. Default value is 0.2

  • MAF

The minimum allele frequency for filtering marker colums in geno;
default value is 0.05

  • random.pop.size

Allow randomly sampling a subset of individuals in the training set (both geno and pheno). Can be used for didactic purposes. Not useful in real life. Default value is NULL (all training set used)

  • geno.impute.method

Allow missing marker data imputation using the two methods proposed in function A.mat of package rrBLUP, namely:

  • MNI: missing data are replaced by the mean allele frequency of the marker (column in geno)
  • EMI: missing data are replaced using an expectation-maximization methods described in function A.mat. Default value is NULL

Note that these imputation methods are only suited when there are a few missing value, typically in marker data from SNP chips of KasPAR. They are NOT suited for imputing marker data from low density to high density designs, and when there are MANY missing Data as typically provided by GBS. More sophisticated software (e.g. Beagles) should be used before bwgs.

  • geno.reduct.method

Allows sampling a subset of markers for speeding up computing time and/or avoid introducing more noise than informative markers. Options are:

  • RMR (with reduct.size): random sampling of markers (columns in geno)
  • LD (with r2 and MAP): enables "pruning" of markers which are in LD > r2. Only the marker with the least missing value is kept for each pair in LD>r2. To allow faster computation, r2 is estimated using chromosome by chromosome, so a MAP file is required.
  • ANO (with pval): one-way ANOVA are carried out with function lm on pheno with one marker at a time, and only markers with pvalue<pval are kept for GEBV prediction
  • ANO+LD (with pval and r2, MAP is facultative) : combines a first step of maker selection with ANO, then a second step of pruning using LD option.
  • reduct.size

Specifies the number of markers for the genotypic reduction using RMR (reduct.size < m).

  • pval

p value for ANO method, 0 < pval < 1.

  • r2

Coefficient of linkage disequilibrium (LD). Setting 0<r2<1 if the genotypic reduction method is in {LD or ANO+LD }.

  • MAP

A file with markers in rows ane at least ONE columns with colnames= "chrom". Used for computing r2 within linkage groups.

  • predict.method

The options for genomic breeding value prediction methods. The available options are:

  • GBLUP: performs G-BLUP using a marker-based relationship matrix, implemented through BGLR R-library. Equivalent to ridge regression (RR-BLUP) of marker effects.
  • EGBLUP: performs EG-BLUP, i.e. BLUP using a "squared" relationship matrix to model epistatic 2x2 interactions, as described by Jiang & Reif (2015), using BGLR library
  • RR: ridge regression, using package glmnet. In theory, strictly equivalent to gblup.
  • LASSO: Least Absolute Shrinkage and Selection Operator is another penalized regression methods which yield more shrinked estimates than RR. Run by glmnet library.
  • EN: Elastic Net [Zou and Hastie, 2005], which is a weighted combination of RR and LASSO, using glmnet library

Several Bayesian methods, using the BGLR library

  • RR: Bayesian ridge regression: same as rr-blup, but bayesian resolution. Induces homogeneous shrinkage of all markers effects towards zero with Gaussian distribution (de los Campos et al, 2013)
  • BL: Bayesian LASSO: uses an exponential prior on marker variances priors, leading to double exponential distribution of marker effects (Park & Casella 2008)
  • BA: Bayes A uses a scaled-t prior distribution of marker effects. (MEeuwissen 2001).
  • BB: Bayes B, uses a mixture of distribution with a point mass at zero and with a slab of non-zero marker effects with a scaled-t distribution (Habier et al 2011).
  • BC: Bayes C same as Bayes B with a slab with Gaussian distribution.

A more detailed description of these methods can be found in Perez & de los Campos 2014 ([http://genomics.cimmyt.org/BGLR-extdoc.pdf]{.underline}.)

Three semi-parametric methods

  • RKHS: reproductive kernel Hilbert space and multiple kernel MRKHS, using BGLR (Gianola and van Kaam 2008). Based on genetic distance and a kernel function to regulate the distribution of marker effects. This methods is claimed to be effective for detecting non additive effects.
  • RF: Random forest regression, using randomForest library [Breimain, 2001]. This methods uses regression models on tree nodes which are rooted in bootstrapping data. Supposed to be able to capture interactions between markers
  • SVM: support vector machine, run by e1071 library. For details, seeee Chang, Chih-Chung and Lin, Chih-Jen: LIBSVM: a library for Support Vector Machines http://www.csie.ntu.edu.tw/~cjlin/libsv

Value

The object bwgs.predict returns Matrix of dimension nx3. Columns are:

  • Predict BV: the nx1 vector of GEBVs for the validation set (rows of geno_valid)
  • gpredSD: Standart deviation of estimated GEBV
  • CD: coefficient of determination for each GEBV, estimated as sqrt ((1-stdev(GEBVi))²/2g)

Note that gpredSD and CD are only available for methods using the BGLR library, namely GBLUP, EGBLUP, BA,BB,BC,BL,RKHS and MKRKHS.
These two columns contain NA for methods RF, RR, LASSO, EN and SVM.

Examples

Library (bwgs)

Data (inra)

testPREDICT_GBLUP=bwgs.predict(geno_train=geno_shrink0001,pheno_train=pheno,geno_target=XTARGET,MAXNA=0.2,MAF=0.05,geno.reduct.method="NULL",reduct.size="NULL",r2="NULL",pval="NULL",MAP="NULL",geno.impute.method="MNI",predict.method="GBLUP")

testPREDICT_EGBLUP=bwgs.predict(geno_train=geno_shrink0001,pheno\train=pheno,geno_target=XTARGET,MAXNA=0.2,MAF=0.05,geno.reduct.method="NULL",reduct.size="NULL",r2="NULL",pval="NULL",MAP="NULL",geno.impute.method="MNI",predict.method="EGBLUP")
testPREDICT_BA=bwgs.predict(geno_train=geno_shrink0001,pheno_train=pheno,geno_target=XTARGET,MAXNA=0.2,MAF=0.05,geno.reduct.method="NULL",reduct.size="NULL",r2="NULL",pval="NULL",MAP="NULL",geno.impute.method="MNI",predict.method="BA")

# correlation between prediction method

cor
cbind((testPREDICT_GBLUP[,1],testPREDICT_EGBLUP[,1],testPREDICT_BA[,1]))

CHROMLD - Marker selection by LD pruning within each chromosome

Description

The CHROMLD computes a correlation matrix (r²) for all marker pairs, chromosome by chromosome. Then, in an iterative process, one marker for the pair with maximum r² is removed, until no r²>R2seuil remains.

Usage

CHROMLD (geno,R2seuil,MAP)

Arguments

  • geno

Matrix (n x m) of genotypes: n lines with m markers. Genotypes should be coded as {-1, 0, 1, NA} The missing data are allowed or coded as NA.

  • R2seuil

the threshold value for removing pairs of markers with r²>R2seuil

  • MAP

a matrix with at least one columns entitled "chrom"

Value

The object bwgs.predict returns Matrix of dimension nx3. Columns are:

  • Predict BV: the nx1 vector of GEBVs for the validation set (rows of
  • gpredSD: Standart deviation of estimated GEBV
  • CD: coefficient of determination for each GEBV, estimated as sqrt ((1-stdev(GEBVi))²/2g)

Note that gpredSD and CD are only available for methods using the BGLR library, namely GBLUP, EGBLUP, BA,BB,BC,BL,RKHS and MKRKHS. These two columns contain NA for methods RF, RR, LASSO, EN and SVM.

Examples

library(bwgs)

data(inra)

# Impute using EMI:

genoLD95 <- CHROMLD(geno47K, R2seuil=0.95, MAP)

EMI - Expectation-Maximization Imputation

Description

The EMI() function is used to impute the missing data points using

Expectation-Maximization algorithm. It is an interative procedure in which it uses other variables to impute a value (Expectation), then checks whether that is the value most likely (Maximization). Uses A.mat function of library rrBLUP (Endelman and Janninck 2012).

Note that this method has been developed to impute missing values that are evenly distributed in the dataset, as typically produced by GBS (genotyping by sequencing e.g. Poland et al 2012). They are NOT suited for imputing marker data from low density to high density designs, and when there are MANY missing Data as typically provided by GBS.

More sophisticated software (e.g. Beagles, Browning and Browning 2016 ) should be used before bwgs.

Usage

EMI (geno)

Arguments

  • geno

Matrix (n x m) of genotypes: n lines with m markers. Genotypes should be coded as {-1, 0, 1, NA}
The missing data are allowed or coded as NA.

Value

EMI returns a n x m matrix with missing marker data imputed by the A.mat function of package rrBLUP with impute.method = "EM".

Examples

library(bwgs)

data(inra)

# Impute using EMI:

geno_EMI <- EMI(geno47K)

References

Browning, B. L., & Browning, S. R. (2016). Genotype Imputation with Millions of Reference Samples. American Journal of Human Genetics, 98(1), 116--126. http://doi.org/10.1016/j.ajhg.2015.11.020

Endelman, J. B., & Jannink, J.-L. (2012). Shrinkage Estimation of the Realized Relationship Matrix. G3: Genes|Genomes|Genetics, 2(11), 1405--1413. http://doi.org/10.1534/g3.112.004259

Poland, J., Endelman, J.B., Dawson, J., Rutkoski, J.E., Wu, S., Manès, Y., Dreisigacker, S., Crossa, J., Sanchez-Villeda, H., Sorrells, M.E., & Jannink, J. (2012). Genomic Selection in Wheat Breeding using Genotyping-by-Sequencing.

inra - data from INRA breeding data, genotype and phenotype

Description

inra data contains a set of geno47K(760 x 47839), pheno (760 x 1) and MAP47K (47839 x 3). The phenotype pheno contains adjusted genotype means for yield trait (YLD) over multi-year/location trials.

Usage

data(inra)

MNI - MeaN Impute

Description

The MNI() function replaces each missing data point by the mean of all non-missing data points in the same column.

Note that this imputation method is only suited when there are a few missing value, typically in marker data from SNP chips of KasPAR. They are NOT suited for imputing marker data from low density to high density designs, and when there are MANY missing Data as typically provided by GBS.

More sophisticated software (e.g. Beagles) should be used before BWGS

Usage

MNI(geno)

Arguments

  • geno

Matrix (n x m) of genotypes: n lines with m markers. Genotypes should be coded as {-1, 0, 1, NA} The missing data are allowed or coded as NA.

Value

EMI returns a n x m matrix with missing marker data imputed by the A.mat function of package rrBLUP with impute.method = "mean".

Examples

library(bwgs)

data(inra)

geno_MNI <- MNI(geno47K)

optiTRAIN - Optimization of TRAINING set by CDmean

Description

The optiTRAIN() function select a subset of lines using an iterative sampling algorithm to maximize the its predictive ability, tested through cross-validation. Methods is based on the CD-mean method of Rincent et al (2012)

Usage

optiTRAIN=function(geno, NSample=100,Nopti=1000)

Arguments

  • Geno

Matrix (n x m) of genotypes: n lines with m markers. Genotypes should be coded as {-1, 0, 1, NA} The missing data are allowed or coded as NA.

  • Nsample

Size of desired subsample

  • Nopti

Number of iterations to achieve optimized sample. Similar and simulated annealing. Recommended value >1000. Lead to long computing time

Value

EMI returns a list with:

  • CDmax is the maximum CD after optimization alrorith has run out.
  • samplOptimiz is the list of the Nsample individuals which led to CDmax according to Rincent el al (2012)
  • genoOptimiz is a Nsample x m matrix of genotypes for the sampleOptimiz selected individuals.

Examples

library(bwgs)

data(inra)

# 50% missing:

Train_opti300 <- optiTRAIN (geno47K, 300, 1000)

Reference

Rincent, R., D. Laloe, S. Nicolas, T. Altmann, D. Brunel, P. Revilla, V., M Rodriguez J. Moreno-Gonzalez, A. Melchinger, E. Bauer, C.C. Schoen, N. Meyer, C. Giauffret, C. Bauland, P. Jamin, J. Laborde, H. Monod, P. Flament, A. Charcosset, and L. Moreau. 2012. Maximizing the reliability of genomic selection by optimizing the calibration set of reference individuals: Comparison of methods in two diverse groups of maize inbreds (Zea may L.) Genetics 192:715--728 doi:10.1534/genetics.112.141473

qtlSIM - Simulation of QTL

Description

qtlSIM simulate quantitative trait loci on actual markers sampled from the genotypic data, with options of number of QTL and heritability value. NQTL markers are randomly sampled from the genotypic matrix, then given an effect drown from a Gaussian distribution with 0 mean and 1 variance. The quantitative trait is generated by summing the NQTL effects, then adding a random noise from a Gaussian distribution with 0 mean and variance equal to (V(QTLeffetcs) * ((1-h2)/h2)

Usage

qtlSIM (geno, NQLT, h2)

Arguments

  • geno

Matrix (n x m) of genotypes: n lines with m markers. Genotypes should be coded as {-1, 0, 1, NA}. Missing data are allowed or coded as NA.

  • NQLT

Number of desired QTL

  • h2

Desired heritability value of the simulated quantitative trait 0

leqleq
h2
leqleq
1

Value

qtlSIM returns a list with:

result <- list(newSNP=X,pheno=QT, TBV=TBV, Effects=Effects, h2QTL=h2QTL )
  • newSNP is a new geno matrix without makers used as QTL
  • pheno is the simulated phenotype.
  • Effects is the vector of simulated effects of the NQTL sampled markers
  • H2QTL is the heritability of individual QTLS, i.e. their squared Effects divided by Var (pheno)

Examples

library(bwgs)

data(inra)

# QTL simulation:

TestQTL = qtlSIM (geno47K, NQLT=100, h2=0.3)

RMR - Random Marker Reconstruction

Description

The RMR() function selects randomly a number of markers in the original genotypic matrix to generate a new genotypic matrix with keeping the orders and names of individuals.

Usage

RMR (geno, N)

Arguments

  • geno

Matrix (n x m) of genotypes: n lines with m markers. Genotypes should be coded as {-1, 0, 1, NA} Missing data are allowed and coded as NA.

  • N

Number of random markers, 1

leqleq
N
leqleq
m

Value

RMR returns a n x N genotypic matrix of N randomly selected markers

Examples

library(bwgs)

data(inra)

# Select 5000 random markers:

Geno5K <- RMR(geno47K, 5000)

RPS - Random Pop Size

Description

The RPS() function selects randomly a number of lines (or individuals) in the original genotypic matrix to generate a new genotypic matrix with keeping the orders and names of markers.

Usage

RPS(geno, N)

Arguments

  • geno

Matrix (n x m) of genotypes: n lines with m markers. Genotypes should be coded as {-1, 0, 1, NA}. Missing data are allowed and coded as NA.

  • N

Number of random lines, 1

leqleq
N
leqleq
n

Value

RPS return a N x m genotypic matrix of N randomly selected individuals

Examples

library(bwgs)

data(inra)

# Select 200 random lines:

Sample200 <- RPS(geno47K, 200)