Skip to content
Snippets Groups Projects
README.md 3.79 KiB
Newer Older
# G2F Challenge
## Objectives

juaubert's avatar
juaubert committed
This repository is the condensed workflow of the ML_APT team for the challenge [Genomes to Fields (G2F) Genotype by Environment Prediction Competition (2022)](https://www.maizegxeprediction2022.org/). The goal was to predict corn yield in 2022, on different locations in U.S.A. and Canada, based on corn yields from 2014 to 2021. With the yields we also used Genomic data from the hybrids, as well as, Environmental data (Weather - Soil - Meta data). 
juaubert's avatar
juaubert committed
## How to use our workflow
juaubert's avatar
juaubert committed
### Clone the repository

### Copy the datasets
To use this repository and predict the yields from your computer you need first to store the data from G2F challenge on the right places.
VANRENTERGHEM Théodore's avatar
VANRENTERGHEM Théodore committed
In the folders `Testing_Data` and `Training_Data` you should copy the following files according to this table. Unused files aren't mandatory.

Folder | `Training_Data` | `Testing_Data` 
-------|-----------------|----------------------
1 - Hybrid x Location | `1_Training_Trait_Data_2014_2021.csv` | `1_Submission_Template_2022.csv`
2 - Meta Data | `2_Training_Meta_Data_2014_2021.csv` | `2_Testing_Meta_Data_2022.csv`
3 - Soil Data | `3_Training_Soil_Data_2015_2021.csv` | `3_Testing_Soil_Data_2022.csv`
4 - Weather Data | `4_Training_Weather_Data_2014_2021.csv` | `4_Testing_Weather_Data_2022.csv`
5 - Genomic Data | `5_Genotype_Data_All_Years.vcf` | 
6 - EC Data (unused) |`6_Training_EC_Data_2014_2021.csv` | `6_Testing_EC_Data_2022.csv`
7 - Hybrid names (unused) | `All_hybrid_names_info.csv`|

juaubert's avatar
juaubert committed
### Run the complete workflow
juaubert's avatar
juaubert committed
To perform the analysis, the file `Simplified_Workflow/Main.qmd` needs to be rendered. You can do this from the RStudio render option or by running this line from the project repository (`g2f_challenge.Rproj`).
VANRENTERGHEM Théodore's avatar
VANRENTERGHEM Théodore committed
```r
quarto::quarto_render("Simplified_Workflow/Main.qmd")
VANRENTERGHEM Théodore's avatar
VANRENTERGHEM Théodore committed
### Machine Capacity 

Due to heavy files and R difficulty to clean it's environment you may experiment RAM incident. You can still run all the called R scripts like this and clean the environment if needed in between :

1. **Main Settings**
VANRENTERGHEM Théodore's avatar
VANRENTERGHEM Théodore committed

```r
# Setting environnement
library(SpATS)
library(emmeans)
library(lme4)
library(data.table)
library(tidyverse)
library(future)
library(purrr)
library(furrr)
library(tictoc)
library(lubridate)
library(rsample)
library(randomForest)
library(randomForestExplainer)
library(FactoMineR)
library(factoextra)
library(glmnet)

# Setting directories
if (!dir.exists("../Filtered_Training_Data")) {
  dir.create(path = "../Filtered_Training_Data")
}
if (!dir.exists("../Filtered_Testing_Data")) {
  dir.create(path = "../Filtered_Testing_Data")
}
if (!dir.exists("../Results")) {
  dir.create(path = "../Results")
}
```
2. **Run files**
VANRENTERGHEM Théodore's avatar
VANRENTERGHEM Théodore committed

```r
source("Simplified_Workflow/frequences_alleliques.R")
source("Simplified_Workflow/GenomicPCA.R")
source("Simplified_Workflow/AnalysePheno.R")
source("Simplified_Workflow/Meta_Data_Treatment_Theo.R")
source("Simplified_Workflow/script_get_pheno.R")
source("Simplified_Workflow/main_merging_meta_soil.R")
source("Simplified_Workflow/Weather_data_transformation.R")
source("Simplified_Workflow/FlowringDateAnalysis.R")
source("Simplified_Workflow/PredictPlantingDates.R")
source("Simplified_Workflow/Weather_by_environnement_analysis.R")
source("Simplified_Workflow/Weather_by_environnement_analysis.R")
source("Simplified_Workflow/IterativeRF.R")
```
VANRENTERGHEM Théodore's avatar
VANRENTERGHEM Théodore committed
## Results 
VANRENTERGHEM Théodore's avatar
VANRENTERGHEM Théodore committed
The `Results` directory will then contain 5 files :
1. `Results/MSE_iterative_random_forest.csv` : MSE from Cross validation test used for result selection 
VANRENTERGHEM Théodore's avatar
VANRENTERGHEM Théodore committed
2. `Results/Prediction_l.csv` : Prediction based only on locations (mean by locations)
3. `Results/Prediction_le.csv` : Random Forest based on location means and environmental data
4. `Results/Prediction_lg.csv` : Random Forest based on location means and genomic data
5. `Results/Prediction_leg.csv` : Random Forest based on location means, environmental and  genomic data