Commit 988d8c75 authored by Facundo Muñoz's avatar Facundo Muñoz ®️
Browse files

Convert into RMarkdown.

Make reports, use bibliography, include text.
parent b1dc7e2a
@article{greenland_statistical_2016,
title = {Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations},
volume = {31},
issn = {0393-2990},
url = {http://dx.doi.org/10.1007/s10654-016-0149-3},
doi = {10.1007/s10654-016-0149-3},
abstract = {Misinterpretation and abuse of statistical tests, confidence intervals, and statistical power have been decried for decades, yet remain rampant. A key problem is that there are no interpretations of these concepts that are at once simple, intuitive, correct, and foolproof. Instead, correct use and interpretation of these statistics requires an attention to detail which seems to tax the patience of working scientists. This high cognitive demand has led to an epidemic of shortcut definitions and interpretations that are simply wrong, sometimes disastrously so—and yet these misinterpretations dominate much of the scientific literature. In light of this problem, we provide definitions and a discussion of basic statistics that are more general and critical than typically found in traditional introductory expositions. Our goal is to provide a resource for instructors, researchers, and consumers of statistics whose knowledge of statistical theory and technique may be limited but who wish to avoid and spot misinterpretations. We emphasize how violation of often unstated analysis protocols (such as selecting analyses for presentation based on the P values they produce) can lead to small P values even if the declared test hypothesis is correct, and can lead to large P values even if that hypothesis is incorrect. We then provide an explanatory list of 25 misinterpretations of P values, confidence intervals, and power. We conclude with guidelines for improving statistical interpretation and reporting.},
pages = {337--350},
number = {4},
journaltitle = {European Journal of Epidemiology},
shortjournal = {European Journal of Epidemiology},
author = {Greenland, Sander and Senn, Stephen J. and Rothman, Kenneth J. and Carlin, John B. and Poole, Charles and Goodman, Steven N. and Altman, Douglas G.},
date = {2016-05},
keywords = {2016, hypothesis-testing, significance, confidence\_intervals, pvalues},
file = {Full Text:/home/facu/Work/logistica/Zotero/storage/D3EE8QIB/Greenland et al. - 2016 - Statistical tests, P values, confidence intervals,.pdf:application/pdf}
}
@article{kruschke_bayesian_2017,
title = {The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective},
volume = {In press},
url = {http://dx.doi.org/10.3758/s13423-016-1221-4},
doi = {10.3758/s13423-016-1221-4},
abstract = {In the practice of data analysis, there is a conceptual distinction between hypothesis testing, on the one hand, and estimation with quantified uncertainty on the other. Among frequentists in psychology, a shift of emphasis from hypothesis testing to estimation has been dubbed ” the New Statistics” (Cumming 2014). A second conceptual distinction is between frequentist methods and Bayesian methods. Our main goal in this article is to explain how Bayesian methods achieve the goals of the New Statistics better than frequentist methods. The article reviews frequentist and Bayesian approaches to hypothesis testing and to estimation with confidence or credible intervals. The article also describes Bayesian approaches to meta-analysis, randomized controlled trials, and power analysis.},
journaltitle = {Psychonomic Bulletin \& Review},
shortjournal = {Psychonomic Bulletin \& Review},
author = {Kruschke, John K. and Liddell, Torrin M.},
date = {2017},
keywords = {bayesian, bayesian-frequentist-relationship, cultura\_general, 2017, hypothesis-testing, meta-analysis, significance, confidence\_intervals, bayes-factor, false\_discovery\_rate\_talk, inference, power-analysis},
file = {Kruschke and Liddell - 2017 - The Bayesian New Statistics Hypothesis testing, e.pdf:/home/facu/Work/logistica/Zotero/storage/G6PYAST3/Kruschke and Liddell - 2017 - The Bayesian New Statistics Hypothesis testing, e.pdf:application/pdf}
}
@article{schweiger_fast_2016,
title = {Fast and Accurate Construction of Confidence Intervals for Heritability},
volume = {98},
issn = {00029297},
url = {http://dx.doi.org/10.1016/j.ajhg.2016.04.016},
doi = {10.1016/j.ajhg.2016.04.016},
abstract = {Estimation of heritability is fundamental in genetic studies. Recently, heritability estimation using linear mixed models ({LMMs}) has gained popularity because these estimates can be obtained from unrelated individuals collected in genome-wide association studies. Typically, heritability estimation under {LMMs} uses the restricted maximum likelihood ({REML}) approach. Existing methods for the construction of confidence intervals and estimators of {SEs} for {REML} rely on asymptotic properties. However, these assumptions are often violated because of the bounded parameter space, statistical dependencies, and limited sample size, leading to biased estimates and inflated or deflated confidence intervals. Here, we show that the estimation of confidence intervals by state-of-the-art methods is inaccurate, especially when the true heritability is relatively low or relatively high. We further show that these inaccuracies occur in datasets including thousands of individuals. Such biases are present, for example, in estimates of heritability of gene expression in the Genotype-Tissue Expression project and of lipid profiles in the Ludwigshafen Risk and Cardiovascular Health study. We also show that often the probability that the genetic component is estimated as 0 is high even when the true heritability is bounded away from 0, emphasizing the need for accurate confidence intervals. We propose a computationally efficient method, {ALBI} (accurate {LMM}-based heritability bootstrap confidence intervals), for estimating the distribution of the heritability estimator and for constructing accurate confidence intervals. Our method can be used as an add-on to existing methods for estimating heritability and variance components, such as {GCTA}, {FaST}-{LMM}, {GEMMA}, or {EMMAX}.},
pages = {1181--1192},
number = {6},
journaltitle = {The American Journal of Human Genetics},
author = {Schweiger, Regev and Kaufman, Shachar and Laaksonen, Reijo and Kleber, Marcus E. and März, Winfried and Eskin, Eleazar and Rosset, Saharon and Halperin, Eran},
date = {2016-06},
keywords = {2016, heritability, confidence\_intervals, reml, bootstrap}
}
@article{gelman_p_2013,
title = {P Values and Statistical Practice},
volume = {24},
issn = {1044-3983},
url = {http://dx.doi.org/10.1097/ede.0b013e31827886f7},
doi = {10.1097/ede.0b013e31827886f7},
pages = {69--72},
number = {1},
journaltitle = {Epidemiology},
author = {Gelman, Andrew},
date = {2013-01},
pmid = {23232612},
keywords = {bayesian-frequentist-relationship, 2013, confidence\_intervals, pvalues, false\_discovery\_rate\_talk},
file = {Gelman - 2013 - P Values and Statistical Practice.pdf:/home/facu/Work/logistica/Zotero/storage/SLNL28PN/Gelman - 2013 - P Values and Statistical Practice.pdf:application/pdf}
}
@report{morey_fallacy_2014,
title = {The Fallacy of Placing Confidence in Confidence Intervals},
url = {http://dx.doi.org/10.5281/zenodo.16991},
abstract = {Interval estimates —estimates of parameters that include an allowance for sampling uncertainty— have long been touted as a key component of statistical analyses. There are several kinds of interval estimates, but the most popular are confidence intervals ({CIs}): intervals that contain the true parameter value in some known proportion of repeated samples, on average. The width of confidence intervals is thought to index the precision of an estimate; the parameter values contained within a {CI} are thought to be more plausible than those outside the interval; and the confidence coefficient of the interval (typically 95\%) is thought to index the plausibility that the true parameter is included in the interval. We show in a number of examples that {CIs} do not necessarily have any of these properties, and generally lead to incoherent inferences. For this reason, we recommend against the use of the method of {CIs} for inference.},
author = {Morey, Richard D. and Hoekstra, Rink and Lee, Michael D. and Rouder, Jeffrey N. and Wagenmakers, Eric J.},
date = {2014},
doi = {10.5281/zenodo.16991},
keywords = {bayesian-frequentist-relationship, cultura\_general, chalara-project, 2015, frequentist, confidence\_intervals, false\_discovery\_rate\_talk},
file = {Morey et al. - 2014 - The Fallacy of Placing Confidence in Confidence In.pdf:/home/facu/Work/logistica/Zotero/storage/KIM7AL8K/Morey et al. - 2014 - The Fallacy of Placing Confidence in Confidence In.pdf:application/pdf}
}
@article{agresti_approximate_1998,
title = {Approximate is Better than “Exact” for Interval Estimation of Binomial Proportions},
volume = {52},
issn = {0003-1305},
url = {https://doi.org/10.1080/00031305.1998.10480550},
doi = {10.1080/00031305.1998.10480550},
abstract = {For interval estimation of a proportion, coverage probabilities tend to be too large for “exact” confidence intervals based on inverting the binomial test and too small for the interval based on inverting the Wald large-sample normal test (i.e., sample proportion ± z-score × estimated standard error). Wilson's suggestion of inverting the related score test with null rather than estimated standard error yields coverage probabilities close to nominal confidence levels, even for very small sample sizes. The 95\% score interval has similar behavior as the adjusted Wald interval obtained after adding two “successes” and two “failures” to the sample. In elementary courses, with the score and adjusted Wald methods it is unnecessary to provide students with awkward sample size guidelines.},
pages = {119--126},
number = {2},
journaltitle = {The American Statistician},
author = {Agresti, Alan and Coull, Brent A.},
urldate = {2019-10-18},
date = {1998-05-01},
keywords = {binary-data, confidence\_intervals, 1998, proportions, multivacc-project},
file = {Full Text PDF:/home/facu/Work/logistica/Zotero/storage/7DVJC54A/Agresti and Coull - 1998 - Approximate is Better than “Exact” for Interval Es.pdf:application/pdf;Snapshot:/home/facu/Work/logistica/Zotero/storage/LRVDTVTZ/00031305.1998.html:text/html}
}
@article{morey_fallacy_2016,
title = {The fallacy of placing confidence in confidence intervals},
volume = {23},
issn = {1531-5320},
url = {https://doi.org/10.3758/s13423-015-0947-8},
doi = {10.3758/s13423-015-0947-8},
abstract = {Interval estimates – estimates of parameters that include an allowance for sampling uncertainty – have long been touted as a key component of statistical analyses. There are several kinds of interval estimates, but the most popular are confidence intervals ({CIs}): intervals that contain the true parameter value in some known proportion of repeated samples, on average. The width of confidence intervals is thought to index the precision of an estimate; {CIs} are thought to be a guide to which parameter values are plausible or reasonable; and the confidence coefficient of the interval (e.g., 95 \%) is thought to index the plausibility that the true parameter is included in the interval. We show in a number of examples that {CIs} do not necessarily have any of these properties, and can lead to unjustified or arbitrary inferences. For this reason, we caution against relying upon confidence interval theory to justify interval estimates, and suggest that other theories of interval estimation should be used instead.},
pages = {103--123},
number = {1},
journaltitle = {Psychonomic Bulletin \& Review},
shortjournal = {Psychon Bull Rev},
author = {Morey, Richard D. and Hoekstra, Rink and Rouder, Jeffrey N. and Lee, Michael D. and Wagenmakers, Eric-Jan},
urldate = {2020-06-19},
date = {2016-02-01},
langid = {english},
keywords = {bayesian-frequentist-relationship, cultura\_general, 2016, frequentist, confidence\_intervals, false\_discovery\_rate\_talk},
file = {Springer Full Text PDF:/home/facu/Work/logistica/Zotero/storage/KJ66CC83/Morey et al. - 2016 - The fallacy of placing confidence in confidence in.pdf:application/pdf}
}
@article{nalborczyk_pragmatism_2019,
title = {Pragmatism should Not be a Substitute for Statistical Literacy, a Commentary on Albers, Kiers, and Van Ravenzwaaij (2018)},
volume = {5},
issn = {2474-7394},
url = {https://online.ucpress.edu/collabra/article/doi/10.1525/collabra.197/112982/Pragmatism-should-Not-be-a-Substitute-for},
doi = {10.1525/collabra.197},
abstract = {Based on the observation that frequentist confidence intervals and Bayesian credible intervals sometimes happen to have the same numerical boundaries (under very specific conditions), Albers et al. (2018) proposed to adopt the heuristic according to which they can usually be treated as equivalent. We argue that this heuristic can be misleading by showing that it does not generalise well to more complex (realistic) situations and models. Instead of pragmatism, we advocate for the use of parsimony in deciding which statistics to report. In a word, we recommend that a researcher interested in the Bayesian interpretation simply reports credible intervals.},
pages = {13},
number = {1},
journaltitle = {Collabra: Psychology},
author = {Nalborczyk, Ladislas and Bürkner, Paul-Christian and Williams, Donald R.},
editor = {Savalei, Victoria and Savalei, Victoria},
urldate = {2021-05-12},
date = {2019-01-01},
langid = {english},
keywords = {2019, bayesian, bayesian-frequentist-relationship, reference, confidence\_intervals, philosophy, credible-interval},
file = {Full Text:/home/facu/Work/logistica/Zotero/storage/WMVPKYFE/Nalborczyk et al. - 2019 - Pragmatism should Not be a Substitute for Statisti.pdf:application/pdf}
}
@article{wang_confidence_2001,
title = {Confidence interval for the mean of non-normal data},
volume = {17},
issn = {1099-1638},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/qre.400},
doi = {10.1002/qre.400},
abstract = {The problem of constructing a confidence interval for the mean of non-normal data is considered. The Bootstrap method and the Box–Cox transformation method of constructing the confidence interval are compared with the normal theory method. Simulation studies are used to evaluate the performance of these different methods of constructing confidence intervals. The result is not surprising; the Bootstrap method is more effective and efficient than the Box–Cox transformation method and the normal theory method in this simulation study. A real example demonstrates the ability of these methods to construct a confidence interval for the mean of audit accounting data. Copyright © 2001 John Wiley \& Sons, Ltd.},
pages = {257--267},
number = {4},
journaltitle = {Quality and Reliability Engineering International},
author = {Wang, F. K.},
urldate = {2021-08-19},
date = {2001},
langid = {english},
note = {\_eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/qre.400},
keywords = {2001, bootstrap, confidence\_intervals},
file = {Full Text PDF:/home/facu/Work/logistica/Zotero/storage/34ZQB7V8/Wang - 2001 - Confidence interval for the mean of non-normal dat.pdf:application/pdf;Snapshot:/home/facu/Work/logistica/Zotero/storage/L5VHW54I/qre.html:text/html}
}
\ No newline at end of file
## packages ---
title: "Confidence Interval for a ratio"
author: "Facundo Muñoz"
date: "`r format(Sys.Date(), '%e %B, %Y')`"
output:
bookdown::pdf_document2:
template: null
documentclass: cirad
bibliography: confidence_intervals.bib
editor_options:
markdown:
wrap: 72
---
```{r packages, include = FALSE}
pacman::p_load( pacman::p_load(
"boot", "boot",
"furrr", "furrr",
...@@ -10,9 +25,9 @@ pacman::p_load( ...@@ -10,9 +25,9 @@ pacman::p_load(
"readxl", "readxl",
"tidyverse" "tidyverse"
) )
```
```{r setup, include=FALSE, cache = FALSE}
## setup
knitr::opts_chunk$set( knitr::opts_chunk$set(
echo = FALSE, echo = FALSE,
cache = FALSE, cache = FALSE,
...@@ -21,9 +36,10 @@ knitr::opts_chunk$set( ...@@ -21,9 +36,10 @@ knitr::opts_chunk$set(
) )
theme_set(theme_ipsum(grid = "Y")) theme_set(theme_ipsum(grid = "Y"))
```
## parameters ```{r parameters}
sim_pars <- tribble( sim_pars <- tribble(
~param, ~value, ~param, ~value,
"N", 334, "N", 334,
...@@ -31,9 +47,9 @@ sim_pars <- tribble( ...@@ -31,9 +47,9 @@ sim_pars <- tribble(
"b1", -.2 "b1", -.2
) )
```
```{r data}
## data
data_file <- here::here("data/HumanDogRatioData.xlsx") data_file <- here::here("data/HumanDogRatioData.xlsx")
raw_data <- read_excel(data_file) raw_data <- read_excel(data_file)
...@@ -56,37 +72,73 @@ clean_data <- raw_data |> ...@@ -56,37 +72,73 @@ clean_data <- raw_data |>
select(Idmenage, zone = ru_urb, Nbh, Ndog) |> select(Idmenage, zone = ru_urb, Nbh, Ndog) |>
mutate(dh_ratio = Ndog/Nbh) mutate(dh_ratio = Ndog/Nbh)
```
# Introduction
We surveyed 2 variables ($X$ and $Y$ counts) from a population and we
are interested in their ratio $Y/X$. We want to make inference on the
**average ratio** in the population.
We need to make sure that $X > 0$ to avoid infinities in the ratio.
Typically, we survey one variable ($Y$) given non-negative values of the
other. So, choose the variables accordingly. Specifically, we survey
households inhabited by at least one person, and conditioned to that, we
count the number of dogs. Not the converse.
\clearpage
## Sample distributions of survey data (dog-human ratio, number of humans and # Data description
## number of dogs) by zone.
clean_data |> ```{r data-description, fig.cap = cap}
cap <- "Sample distributions of survey data (dog-human ratio, number of humans and number of dogs) by zone."
clean_data |>
pivot_longer( pivot_longer(
Nbh:dh_ratio, Nbh:dh_ratio,
names_to = "variable", names_to = "variable",
values_to = "value" values_to = "value"
) |> ) |>
ggplot(aes(value)) + ggplot(aes(value)) +
geom_histogram(bins = 15) + geom_histogram(bins = 15) +
facet_grid(zone ~ variable, scales = "free") facet_grid(zone ~ variable, scales = "free")
```
# Method 1: classical CI for a population mean
This is the most standard and classical method, which is based on
asymptotic normality of the sample mean of any distribution
[@wang_confidence_2001].
Let $R = Y/X$ be the quantity of interest. Consider the sample mean
$\bar R = \sum_{i=1}^n r_i/n$ and variance
$S^2 = \sum_{i=1}^n (r_i - \bar R)^2 / (n-1)$.
## confint-ratio-normal The theory says that, when $n \to \infty$,
$(\bar R - \mu_R) / (S/\sqrt{n}) \sim t_{n-1}$. Thus, a confidence
interval for the population mean $\mu_R$ is $$
\bar R \pm t_{\alpha/2}\,S/\sqrt{n}
$$ where $1-\alpha$ is the confidence level and $t_{\alpha/2}$ is the
upper tail $\alpha/2$ percentile of the Student $t$ distribution with
$n-1$ degrees of freedom.
```{r confint-ratio-normal, echo = TRUE}
confint_ratio_normal <- function(x, alpha = 0.05) { confint_ratio_normal <- function(x, alpha = 0.05) {
hatx <- mean(x) hatx <- mean(x)
s2 <- var(x) s2 <- var(x)
n <- length(x) n <- length(x)
ta2 <- qt(alpha/2, n - 1, lower.tail = FALSE) ta2 <- qt(alpha/2, n - 1, lower.tail = FALSE)
return(hatx + ta2*sqrt(s2/n) * c(-1, 1)) return(hatx + ta2*sqrt(s2/n) * c(-1, 1))
} }
```
```{r res-normal}
## res-normal res_normal <-
res_normal <-
clean_data |> clean_data |>
select(zone, y = dh_ratio) |> select(zone, y = dh_ratio) |>
group_by(zone) |> group_by(zone) |>
summarise( summarise(
Mean = mean(y), Mean = mean(y),
Variance = var(y), Variance = var(y),
...@@ -94,17 +146,33 @@ res_normal <- ...@@ -94,17 +146,33 @@ res_normal <-
CI95_b = confint_ratio_normal(y)[2], CI95_b = confint_ratio_normal(y)[2],
.groups = "drop" .groups = "drop"
) )
```
```{r table-confint-normal}
cap <- "Results using the Normal asymptotic approximation."
kbl( kbl(
res_normal, res_normal,
booktabs = TRUE, booktabs = TRUE,
digits = 3, digits = 3,
caption = "Results using the Normal asymptotic approximation." caption = cap
) )
```
If all is required are rough estimates of the average dog-human ratio in urban and rural areas, this method is good enough.
However, we can not make inference about the __difference__ of the urban and rural ratios.
In particular, drawing conclusions from the overlap in the confidence intervals is incorrect.
For that we need something else, like bootstrap estimates.
## confint-ratio-boot # Method 2: Bootstrapping
The _bootstrap_ is a non-parametric computational approach based on resampling with replacement the observed data in order to simulate a large number of replications of the experiment.
There are several variations of the method for calculating confidence intervals. I will use the basic version here.
```{r confint-ratio-boot, echo = TRUE}
get_estimates <- function(x, i) { get_estimates <- function(x, i) {
i_u <- i[x$zone[i] == "URBAIN"] i_u <- i[x$zone[i] == "URBAIN"]
i_r <- i[x$zone[i] == "RURAL"] i_r <- i[x$zone[i] == "RURAL"]
...@@ -113,23 +181,24 @@ get_estimates <- function(x, i) { ...@@ -113,23 +181,24 @@ get_estimates <- function(x, i) {
c(RURAL = m_r, URBAIN = m_u, `U-R` = m_u - m_r) c(RURAL = m_r, URBAIN = m_u, `U-R` = m_u - m_r)
} }
clean_data_boot <- boot(clean_data, get_estimates, R = 1e4) clean_data_boot <- boot(clean_data, get_estimates, R = 1e4)
```
## res-boot ```{r res-boot}
res_boot <- res_boot <-
clean_data_boot$t0 |> clean_data_boot$t0 |>
enframe(name = "zone", value = "Mean") |> enframe(name = "zone", value = "Mean") |>
left_join( left_join(
map_dfr( map_dfr(
setNames(seq_along(clean_data_boot$t0), names(clean_data_boot$t0)), setNames(seq_along(clean_data_boot$t0), names(clean_data_boot$t0)),
~ boot.ci(clean_data_boot, index = ., type = "basic")$basic[1, 4:5] ~ boot.ci(clean_data_boot, index = ., type = "basic")$basic[1, 4:5]
) |> ) |>
add_column(side = c("CI95_a", "CI95_b")) |> add_column(side = c("CI95_a", "CI95_b")) |>
pivot_longer( pivot_longer(
cols = -side, cols = -side,
names_to = "zone", names_to = "zone",
values_to = "value" values_to = "value"
) |> ) |>
pivot_wider( pivot_wider(
names_from = "side", names_from = "side",
values_from = "value" values_from = "value"
...@@ -137,44 +206,95 @@ res_boot <- ...@@ -137,44 +206,95 @@ res_boot <-
by = "zone" by = "zone"
) )
```
```{r table-confint-boot}
cap <- "Results using the basic Bootstrap method."
kbl( kbl(
res_boot, res_boot,
booktabs = TRUE, booktabs = TRUE,
digits = 3, digits = 3,
caption = "Results using the basic Bootstrap method." caption = cap
) )
```
We get similar confidence intervals for the dog-human ratio in rural and urban areas.
A little downward shift, specially in rural areas.
But more importantly, we get inference on the difference, of which we can say that is of an order of magnitude smaller and non-significant.
## fm1
Yet, if we need something _more_ than the population mean, like predictive statements such as _what is the chance of a household having more dogs than expected, or the chance of having at least a dog_, then we need a proper model.
# Method 3: Statistical modelling
Since we are working with counts, the simplest model is Poisson.
\begin{equation}
\begin{aligned}
\label{eq:model}
Y_i & \sim \text{Po}(X_i\cdot\lambda_i) \\
\lambda_i & = \beta_0 + \beta_U \mathbb{I}_U
\end{aligned}
\end{equation}
were $\mathbb{I}_U$ is an indicator variable of urban area.
In this model, the number of dogs $Y_i$ in a household $i$ is a Poisson
random variable, with mean proportional to the number of human
inhabitants $X_i$ with a proportionality constant $\lambda_i$, which is
the parameter of interest. This represents the average dog-human ratio,
and depends on the zone (rural/urban) of the household.
Specifically, $\lambda_R = \beta_0$ is the average dog-human log-ratio
in urban areas, and $\lambda_U = \beta_0 + \beta_U$ is the average
dog-human log-ratio in rural areas.
```{r fm1}
fm1 <- glm( fm1 <- glm(
formula = Ndog ~ zone, formula = Ndog ~ zone,
family = "poisson", family = "poisson",
offset = clean_data$Nbh, offset = clean_data$Nbh,
data = clean_data data = clean_data
) )
```
```{r fm1-summary}
## fm1-summary
summary(fm1) summary(fm1)
```
## table-estimates-model ```{r table-estimates-model}
exp(c(coef(fm1), sum(coef(fm1)))[c(1, 3, 2)]) |> exp(c(coef(fm1), sum(coef(fm1)))[c(1, 3, 2)]) |>
setNames(c("RURAL", "URBAIN", "U/R")) |> setNames(c("RURAL", "URBAIN", "U/R")) |>
enframe(name = "zone", value = "Mean") enframe(name = "zone", value = "Mean")
```
This is giving much smaller ratios, and a factor of 6 in favour of urban areas.
Note that we no longer talk of differences but of relative factor, as a consequence of the model formulation.
Computing confidence intervals here would involve the use of the Bootstrap again, but using the model estimates instead of empirical averages.
However, there is evidence of over-dispersion in the data, so a more appropriate model should be first developed in order to continue the analysis.
\clearpage
# Simulation study
```{r sim-pars}
cap <- "Parameters used for simulating data."
sim_pars |> sim_pars |>
kbl( kbl(
booktabs = TRUE, booktabs = TRUE,
caption = "Parameters used for simulating data." caption = cap
) )
```
In order to evaluate the accuracy of the alternative methods, I simulated data from model \@ref(eq:model) with parameters (Table \@ref(tab:sim-pars)) that mimic the real observed data (Figure \@ref(fig:sim-data-description)).
## Sample distributions of __simulated__ survey data (dog-human ratio (r),
## number of humans (x) and number of dogs (y)) by zone. ```{r sim-data-description, fig.cap = cap}
cap <- "Sample distributions of __simulated__ survey data (dog-human ratio (r), number of humans (x) and number of dogs (y)) by zone."
sim_data |> sim_data |>
pivot_longer( pivot_longer(
x:r, x:r,
...@@ -184,4 +304,19 @@ sim_data |> ...@@ -184,4 +304,19 @@ sim_data |>
ggplot(aes(value)) + ggplot(aes(value)) +
geom_histogram(bins = 15) + geom_histogram(bins = 15) +
facet_grid(zone ~ variable, scales = "free") facet_grid(zone ~ variable, scales = "free")
```
# Conclusions
If simple confidence intervals in urban and rural areas are required for reporting, the Normal approximation will suffice.
If the difference between urban and rural ratios is of interest as well, then a Bootstrap method can be used.
However, if the actual interest is on the number of dogs that can be expected in a household, then a proper statistical model is needed.
In particular, modelling reveals that considering the difference of ratios can be misleading since it is determined in part by the distribution of the number of inhabitants in urban and rural areas.
It might make more sense to consider the ratios instead.
# References
<?xml version="1.0" encoding="utf-8"?>
<style xmlns="http://purl.org/net/xbiblio/csl" version="1.0" default-locale="en-US">
<!-- Generated with https://github.com/citation-style-language/utilities/tree/master/generate_dependent_styles/data/plos -->
<info>
<title>PLOS ONE</title>
<id>http://www.zotero.org/styles/plos-one</id>
<link href="http://www.zotero.org/styles/plos-one" rel="self"/>
<link href="http://www.zotero.org/styles/plos" rel="independent-parent"/>
<link href="http://www.plosone.org/static/guidelines#references" rel="documentation"/>
<category citation-format="numeric"/>
<category field="science"/>
<eissn>1932-6203</eissn>
<updated>2014-06-05T22:31:00+00:00</updated>
<rights license="http://creativecommons.org/licenses/by-sa/3.0/">This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License</rights>
</info>
</style>
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment