README.md

# G2F Challenge

## Objectives

This repository is the condensed workflow of the ML_APT team for the challenge [Genomes to Fields (G2F) Genotype by Environment Prediction Competition (2022)](https://www.maizegxeprediction2022.org/). The goal was to predict corn yield in 2022, on different locations in U.S.A. and Canada, based on corn yields from 2014 to 2021. With the yields we also used Genomic data from the hybrids, as well as, Environmental data (Weather - Soil - Meta data). 

## How to use our workflow

### Clone the repository

### Copy the datasets

To use this repository and predict the yields from your computer you need first to store the data from G2F challenge on the right places.
In the folders `Testing_Data` and `Training_Data` you should copy the following files according to this table. Unused files aren't mandatory.

Folder | `Training_Data` | `Testing_Data` 
-------|-----------------|----------------------
1 - Hybrid x Location | `1_Training_Trait_Data_2014_2021.csv` | `1_Submission_Template_2022.csv`
2 - Meta Data | `2_Training_Meta_Data_2014_2021.csv` | `2_Testing_Meta_Data_2022.csv`
3 - Soil Data | `3_Training_Soil_Data_2015_2021.csv` | `3_Testing_Soil_Data_2022.csv`
4 - Weather Data | `4_Training_Weather_Data_2014_2021.csv` | `4_Testing_Weather_Data_2022.csv`
5 - Genomic Data | `5_Genotype_Data_All_Years.vcf` | 
6 - EC Data (unused) |`6_Training_EC_Data_2014_2021.csv` | `6_Testing_EC_Data_2022.csv`
7 - Hybrid names (unused) | `All_hybrid_names_info.csv`|

### Run the complete workflow

To perform the analysis, the file `Simplified_Workflow/Main.qmd` needs to be rendered. You can do this from the RStudio render option or by running this line from the project repository (`g2f_challenge.Rproj`).

```r
quarto::quarto_render("Simplified_Workflow/Main.qmd")
```

### Machine Capacity 

Due to heavy files and R difficulty to clean it's environment you may experiment RAM incident. You can still run all the called R scripts like this and clean the environment if needed in between :

1. **Main Settings**

```r
# Setting environnement
library(SpATS)
library(emmeans)
library(lme4)
library(data.table)
library(tidyverse)
library(future)
library(purrr)
library(furrr)
library(tictoc)
library(lubridate)
library(rsample)
library(randomForest)
library(randomForestExplainer)
library(FactoMineR)
library(factoextra)
library(glmnet)

# Setting directories
if (!dir.exists("../Filtered_Training_Data")) {
  dir.create(path = "../Filtered_Training_Data")
}
if (!dir.exists("../Filtered_Testing_Data")) {
  dir.create(path = "../Filtered_Testing_Data")
}
if (!dir.exists("../Results")) {
  dir.create(path = "../Results")
}
```

2. **Run files**

```r
source("Simplified_Workflow/frequences_alleliques.R")
source("Simplified_Workflow/GenomicPCA.R")
source("Simplified_Workflow/AnalysePheno.R")
source("Simplified_Workflow/Meta_Data_Treatment_Theo.R")
source("Simplified_Workflow/script_get_pheno.R")
source("Simplified_Workflow/main_merging_meta_soil.R")
source("Simplified_Workflow/Weather_data_transformation.R")
source("Simplified_Workflow/FlowringDateAnalysis.R")
source("Simplified_Workflow/PredictPlantingDates.R")
source("Simplified_Workflow/Weather_by_environnement_analysis.R")
source("Simplified_Workflow/Weather_by_environnement_analysis.R")
source("Simplified_Workflow/IterativeRF.R")
```

## Results 

The `Results` directory will then contain 5 files :

1. `Results/MSE_iterative_random_forest.csv` : MSE from Cross validation test used for result selection 
2. `Results/Prediction_l.csv` : Prediction based only on locations (mean by locations)
3. `Results/Prediction_le.csv` : Random Forest based on location means and environmental data
4. `Results/Prediction_lg.csv` : Random Forest based on location means and genomic data
5. `Results/Prediction_leg.csv` : Random Forest based on location means, environmental and  genomic data