Commit a7c16b39 authored by Jean-Benoist Leger's avatar Jean-Benoist Leger
Browse files

Merge branch 'docgen'

parents d3520109 e7fdd8cb
......@@ -4,4 +4,5 @@
.*.swo
LineImputer*.tar.gz
LineImputer*.Rcheck
man/
......@@ -7,6 +7,9 @@ stages:
compile:
stage: build
script:
- apt-get update
- apt-get install -y r-cran-roxygen2
- echo 'roxygen2::roxygenise()' | R --no-save
- ./.install-deps
- R CMD build .
- R CMD check LineImputer*.tar.gz
......
......@@ -10,5 +10,4 @@ Encoding: UTF-8
LazyData: true
Imports: Rcpp (>= 0.12.11), stringr (>= 1.2.0), tools
LinkingTo: Rcpp
RoxygenNote: 6.0.1
NeedsCompilation: yes
\name{LineImputer-internal}
\alias{ImputatorCpp}
\alias{ImpThreshold_cpp}
\alias{txttovcfCpp}
\alias{fusionCpp}
\alias{fusionlist}
\title{Internal LineImputer Functions} \description{
Internal LineImputer functions
}
\details{
These are not to be called by the user.
}
\keyword{ internal }
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/impThreshold.R
\name{impThreshold}
\alias{impThreshold}
\title{Impute the file by threshold}
\usage{
impThreshold(probas_filename , probas_path, output_filename,
output_path,threshold = 0.5, nb_supp_elements = 9)
}
\arguments{
\item{probas_filename}{name of a posterior probability file, as created by \code{\link{imputation}}.}
\item{probas_path}{name of directory where \code{probas_filename} is located.}
\item{output_filename}{name of the output file. By default, the same as \code{probas_filename} (with an additional extension).}
\item{output_path}{name of the output directory.}
\item{threshold}{a numeric value between 0.5 and 1.}
\item{nb_supp_elements}{a numeric value indicating the number of metadata columns.}
}
\value{An imputed file with a format identical to the one of \code{probas_filename}. By default the output file is created in the current work directory.
}
\description{This function creates an imputed data file from a posterior probability file by applying a thresholded classification rule to each posterior probability.
}
\details{This function outputs a file of imputed data obtained by applying a thresholded classification rule to the posterior probabilities obtained from \code{prob_filename}. The thresholded rules performs classification as follows:
\itemize{
\item if the posterior probability is higher than \code{threshold}, the imputed value is '1/1',
\item if the posterior probability is lower than 1-\code{threshold}, the imputed value is '0/0',
\item if the posterior probability belongs to [1-\code{threshold},\code{threshold}] then no imputation is performed.
}
When applied with \code{threshold}=0.5 the classification rule boils down to the classical Maximum a Posteriori classification rule.
}
\examples{
## Merging of two VCF files,
## imputation of the merged file using both the Viterbi and Forward Backward algorithms,
## and imputation with another threshold \code{threshold}.
file_path_1 <- system.file('extdata', 'a.vcf', package = 'LineImputer')
file_path_2 <- system.file('extdata', 'b.vcf', package = 'LineImputer')
mergevcf(filelist = c(file_path_1, file_path_2))
imputation(markov_order = 3, input_filename = 'merged_file.vcf', do_viterbi = TRUE)
impThreshold(probas_filename = 'merged_file_MarkOrd_3_PosteriorProbabilities.vcf', threshold = 0.6)
}
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/imputation.R
\name{imputation}
\alias{imputation}
\title{Impute the file VCF}
\usage{
imputation(markov_order = 2, input_path , input_filename,
output_path , nb_supp_elements = 9, threshold = 0.5,
initial_count = 0.5, do_viterbi = FALSE, do_probas = TRUE)
}
\arguments{
\item{markov_order}{a numeric value between 2 and 20 indicating the Markov chain order.}
\item{input_path}{path name of the directory containing the file to be imputed.}
\item{input_filename}{name of the file to be imputed.}
\item{output_path}{path name where the output files should be saved. Default is the current work directory.}
\item{nb_supp_elements}{a numeric value indicating the number of columns corresponding to metadata.}
\item{threshold}{a numeric value between 0.5 and 1.}
\item{initial_count}{initial probability of each datum to be '0/0' or '1/1. Default is 0.5. }
\item{do_viterbi}{a boolean indicating whether the Viterbi algorithm be used for imputation or not. Default is \code{FALSE}.}
\item{do_probas}{a boolean indicating whether the posterior probabilities should be computed. Default is \code{TRUE}.}
}
\value{
If \code{do_probas=TRUE} (default), the function outputs two files:
\itemize{
\item "\code{input_filename}_MarkOrd_\code{markov_order}_PosteriorProbabilities.vcf" that corresponds to the posterior probability file obtained from the Forward-Backward algorithm,
\item "\code{input_filename}_MarkOrd_\code{markov_order}_PosteriorProbabilities_Thres_\code{threshold}_Imputed.vcf" that corresponds to the imputation based on the MAP rule applied to the posterior probabilities.
}
If \code{do_viterbi=TRUE}, the function outputs a .\code{vcf} file of (imputed) data named "\code{input_filename}_MarkOrd_\code{markov_order}_Imputed_Viterbi.vcf".
All output files have a structure identical to the initial file \code{input_filename} provided to the function (i.e. identical metadata). All files are created in the \code{output_path} directory.
}
\description{
This function performs imputation of missing genotypic data based on a Markov chain model.
}
\details{
The function performs imputation (prediction) of missing data based on a non-homogeneous Markov Chain (MC) model. The data to be imputed are assumed to be lines (i.e. pure homozygous individuals) described by bi-allelic markers (SNP), available from the \code{input_filename}.vcf file.
The .\code{vcf} format assumes that lines are displayed in columns and markers in rows. Additionally, the \code{input_filename} file may contain comment lines (starting with two or more "#" characters), and a header line starting with a single "#".
The metadata, i.e. additional information about each marker, should be displayed in the first \code{nb_supp_elements} columns. These metadata are copied in all output files.
The Markov model assumes that, at a given locus, the observation of value '0/0' or '1/1' depends on the observed values at the \code{markov_order} previous loci. One first needs to compute the transition matrix at each locus, then to infer the missing data either by inferring the most probable allelic sequence for each individual, or by inferring at each locus and for each individual the most probable allele.
\itemize{
\item Transition matrices are computed as follows. All values of all transition matrices are initialized at \code{initial_count}. The transition probabilities are then updated by counting (for each locus and each allele) how many times each sequence of size \code{markov_order} is followed by '0/0' or '1/1. \item Finding the most probable sequence can be obtained using the Viterbi algorithm through the \code{do_viterbi} option. The most probable allelic sequence is then output.
\item Obtaining for each individual and at each locus the most probable allelic value can be done using the Forward Backward algorithm. The output is then the posterior probability (for each individual and each locus) to observer '1/1'.
Obtaining an imputed value requires an additional thresholding step: depending on whether the posterior probability is higher than \code{threshold}, lower than 1 - \code{threshold} or in the interval [1-\code{threshold}, \code{threshold}], the output value will be '1/1', '0/0' or './.', respectively.
If \code{threshold=0.5} this boils down to the classical Maximum a Posteriori classification rule, and all missing data are imputed. If \code{threshold>0.5}, only some of the missing values will be imputed.
}
}
\examples{
## Imputation of a VCF file
file_path_folder <- system.file('extdata', package = 'LineImputer')
imputation(markov_order = 3, input_path = file_path_folder, input_filename = 'a.vcf', do_viterbi = TRUE)
## Merging of two VCF files then imputation the merged file.
file_path_1 <- system.file('extdata', 'a.vcf', package = 'LineImputer')
file_path_2 <- system.file('extdata', 'b.vcf', package = 'LineImputer')
mergevcf(filelist = c(file_path_1, file_path_2))
imputation(markov_order = 3, input_filename = 'merged_file.vcf', do_viterbi = TRUE)
}
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/mergevcf.R
\name{mergevcf}
\alias{mergevcf}
\title{Merge two files with rules}
\usage{
mergevcf(output_path, output_filename = "merged_file",
filelist = NULL, dirlist = NULL, ext = "vcf", unk_sign = "./.",
comment_sign = "#", separator = "\t", nb_supp_elements = 9,
value_length = 3, recursive = TRUE)
}
\arguments{
\item{output_path}{path name where the output file should be saved. Default is the current work directory.}
\item{output_filename}{name of the output file.}
\item{filelist}{a list/vector containing the names of the files to be merged.}
\item{dirlist}{a list/vector of names of directories that contain the files to be merged.}
\item{ext}{a string indicating the extension for the files to be merged (only files in filelist or dirlist with the given extension will be merged). The output merged file will have the same extension. Default value is 'vcf'. }
\item{unk_sign}{a single string which is to be interpreted as the NA value. Note that this string should be of size \code{value_length}.}
\item{comment_sign}{a single character to be interpreted as the comment sign. Following the vcf format convention, file lines beginning with two comment signs will be treated as comment lines. A line beginning with one comment sign will be interpreted as the header.}
\item{separator}{the field separator character. Default is tabulation.}
\item{nb_supp_elements}{a numeric value indicating the number of columns corresponding to metadata.}
\item{value_length}{a numeric value indicating the length (i.e. the number of digits) of a datum in each file.}
\item{recursive}{logical. Whether the listing should recurse into directories or not. This is of use only if a \code{dirlist} is specified. Default is \code{TRUE}.}
\item{num_col_sort}{a numeric value indicating which colunm should be used to sort the rows in the merged file.}
}
\value{A file with extension given in the argument 'ext' corresponding to the merging of all files listed in filelist and/or belonging to directories listed in dirlist. Additionally, special files 'Incoherences.txt' and 'log.txt' are created in the same directory as the file of merging.
The file 'Incoherences.txt' contains the names of columns and rows corresponding to mismatches (a mismatch occurs when two data sharing the same column and row names in two different files do not have the same value).
The file 'log.txt' contains the number of columns and rows of each merged file including intermediate merged files which are removed once the merging is completed. This file also contains the list of merged files. The same list appears in the console at the end of merging.
}
\description{
This function merges two or more VCF files by common markers and individuals. Additionally, it can be applied to merge files with other (highly structured) formats.
}
\details{
The \code{mergevcf} function merges two or more VCF files by common markers and individuals. All files are assumed to have a classical \code{.vcf} format: rows correspond to markers, and columns correspond to biological samples. The first lines of the files correspond to comment lines starting with two or more \code{comment_sign}s, and the headers are displayed in a line starting with a single \code{comment_sign}. Each line contains additional information about the marker, contained in the first \code{nb_supp_elements} columns, that constitute the metadata. These metadata are pasted in the output file.
The files are merged according to their column and row names. The merging rules are:
\enumerate{
\item for an unobserved combination (rowname,colname), the resulting coded value is \code{unk_sign},
\item for a combination (rowname, colname) for which either \code{unk_sign} or an identical value V is observed in all files, the resulting coded value is V,
\item for mismatches, the resulting coded value is \code{unk_sign}.
}
The default setting of arguments \code{ext}, \code{unk_sign}, \code{comment_sign}, \code{separator}, \code{nb_supp_elements} and \code{value_length} correspond to the ones of the VCF format. Alternatively \code{mergevcf} may be used to merge other types of structured files, but with severe restrictions regarding their format.
}
\examples{
## merge two VCF files
file_path_1 <- system.file('extdata', 'a.vcf', package = 'LineImputer')
file_path_2 <- system.file('extdata', 'b.vcf', package = 'LineImputer')
mergevcf(filelist = c(file_path_1, file_path_2))
## merge all VCF files in a directory
folder_path <- system.file('extdata', package = 'LineImputer')
mergevcf(dirlist = folder_path)
}
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment