Newer
Older
This repository is the condensed workflow of the ML_APT team for the challenge [Genomes to Fields (G2F) Genotype by Environment Prediction Competition (2022)](https://www.maizegxeprediction2022.org/). The goal was to predict corn yield in 2022, on different locations in U.S.A. and Canada, based on corn yields from 2014 to 2021. With the yields we also used Genomic data from the hybrids, as well as, Environmental data (Weather - Soil - Meta data).
To use this repository and predict the yields from your computer you need first to store the data from G2F challenge on the right places.
In the folders `Testing_Data` and `Training_Data` you should copy the following files according to this table. Unused files aren't mandatory.
Folder | `Training_Data` | `Testing_Data`
-------|-----------------|----------------------
1 - Hybrid x Location | `1_Training_Trait_Data_2014_2021.csv` | `1_Submission_Template_2022.csv`
2 - Meta Data | `2_Training_Meta_Data_2014_2021.csv` | `2_Testing_Meta_Data_2022.csv`
3 - Soil Data | `3_Training_Soil_Data_2015_2021.csv` | `3_Testing_Soil_Data_2022.csv`
4 - Weather Data | `4_Training_Weather_Data_2014_2021.csv` | `4_Testing_Weather_Data_2022.csv`
5 - Genomic Data | `5_Genotype_Data_All_Years.vcf` |
6 - EC Data (unused) |`6_Training_EC_Data_2014_2021.csv` | `6_Testing_EC_Data_2022.csv`
7 - Hybrid names (unused) | `All_hybrid_names_info.csv`|
To perform the analysis, the file `Simplified_Workflow/Main.qmd` needs to be rendered. You can do this from the RStudio render option or by running this line from the project repository (`g2f_challenge.Rproj`).
```r
quarto::quarto_render("Simplified_Workflow/Main.qmd")
### Machine Capacity
Due to heavy files and R difficulty to clean it's environment you may experiment RAM incident. You can still run all the called R scripts like this and clean the environment if needed in between :
```r
# Setting environnement
library(SpATS)
library(emmeans)
library(lme4)
library(data.table)
library(tidyverse)
library(future)
library(purrr)
library(furrr)
library(tictoc)
library(lubridate)
library(rsample)
library(randomForest)
library(randomForestExplainer)
library(FactoMineR)
library(factoextra)
library(glmnet)
# Setting directories
if (!dir.exists("../Filtered_Training_Data")) {
dir.create(path = "../Filtered_Training_Data")
}
if (!dir.exists("../Filtered_Testing_Data")) {
dir.create(path = "../Filtered_Testing_Data")
}
if (!dir.exists("../Results")) {
dir.create(path = "../Results")
}
```
```r
source("Simplified_Workflow/frequences_alleliques.R")
source("Simplified_Workflow/GenomicPCA.R")
source("Simplified_Workflow/AnalysePheno.R")
source("Simplified_Workflow/Meta_Data_Treatment_Theo.R")
source("Simplified_Workflow/script_get_pheno.R")
source("Simplified_Workflow/main_merging_meta_soil.R")
source("Simplified_Workflow/Weather_data_transformation.R")
source("Simplified_Workflow/FlowringDateAnalysis.R")
source("Simplified_Workflow/PredictPlantingDates.R")
source("Simplified_Workflow/Weather_by_environnement_analysis.R")
source("Simplified_Workflow/Weather_by_environnement_analysis.R")
source("Simplified_Workflow/IterativeRF.R")
```
1. `Results/MSE_iterative_random_forest.csv` : MSE from Cross validation test used for result selection
2. `Results/Prediction_l.csv` : Prediction based only on locations (mean by locations)
3. `Results/Prediction_le.csv` : Random Forest based on location means and environmental data
4. `Results/Prediction_lg.csv` : Random Forest based on location means and genomic data
5. `Results/Prediction_leg.csv` : Random Forest based on location means, environmental and genomic data