If you want to run this code locally, look at the last two sections
Here we aim at testing whether the prophylactic use of antibiotics (exposure) has an effect on the occurence of clinical signs (outcome). The main challenge in this analysis is the preparation of the data in such a way that AMU exposure can be considered prophylactic AMU exposure. The overall strategy for data preparation is described in Figure 1.
Figure 1: Preparation of data before analysis. The horizontal arrow represents the time line of a flock, divided into weeks that are represented by rectangles, starting on week 1 (on the left). For any given week w (represented here by the hashed rectangle), we compute (i) an exposure variable based on the use or not of antibiotics on week w (step 1, in green) and (ii) an outcome variable based on the occurence or not of clinical signs over an observation period of x weeks after week w (step 2, in red). Statistical analyses will consist in investigating whether AMU on week w (exposure) affects the occurence of clinical signs over the observation period (outcome). In order to make sure that AMU exposure on week w does correspond to AMU that we can consider as prophylactic, we need to filter out from the analysis all the weeks w preceded by (i) the presence of clinical signs over the y weeks before week w (including week w) and (ii) AMU over the z weeks before week w (naturally excluding week w since this information is used to compute the exposure variable). This step 0 is shown in blue on the figure. Finally, the analysis may include potential confounding factors (shown in orange) such as the age of the chicken (i.e. week w) as well as AMU during the first a weeks of the flock’s life (in grey) and during the x weeks of the observation period.
Note that the duration y over which the filtering based on clinical signs is done and the duration z over which the filtering based on AMU is done do not have to be identical, although it probably makes sense if they are. On Figure 1, CS corresponds to a given set of clinical signs and AMU corresponds to a given set of antimicrobials. Note that the sets of clinical signs used for the filtering process and the computation of the outcome variable do not have to be identical, although, again, it probably makes sense if they are. Same remark for the sets of antimicrobials used for the filtering process, and the computations of the exposure variable and the confounding variables over the first weeks of the flock and the observation period. In the code below, we create the following 4 functions allowing us to prepare the data according to the strategy outlined in Figure 1:
make_exposure(df, antimicrobials)
computes a boolean exposure variable using the set of antimicrobials
antimicrobials of the data frame df
.
make_outcome(df, outcome, loc)
computes a boolean outcome variable for the loc
weeks that follow week w, using the set outcome
of clinical signs of the data frame df
. Note that this function will also be used to compute the AMU covariable during the observation period, in which case outcome
will then contain the set of antimicrobials to consider.
prophylactic(df, clinical_signs, antimicrobials, lcs, lam)
computes a boolean variable expressing whether any given week of the data df
can be included in the analysis or not. This is based on the filtering process explained in figure 1 and considering the clinical_signs
set of clinical signs over the lcs
weeks before week w (including week w) and the antimicrobials
set of antimicrobials over lam
weeks before week w (excluding week w).
first_weeks(df, antimicrobials, wk)
: computes the boolean AMU covariable during the first wk
weeks of the flock’ life, using the antimicrobials
set of antimicrobials.
In these 4 functions, df
is a data frame of the structure of viparc data, as described here. These 4 functions are designed so that they assemble perfectly well into pipelines. Once the data are prepared, the analysis can be performed by different statistical tests / models:
the simplest one is certainely a Fisher exact test. However, it appears that both prophylactic AMU and the occurence of clinical signs strongly depends on the age of the flock (as well as, to a lesser extend, on the AMU during the first weeks of the flock and the observation period). For that reason, we considered more sophisticatted analyses allowing to correct for confounding effects. These are
logistic regression. Here, we compared correction by partial likelihood ratio tests and stratification. It produces similar results. Because the effect of age on the occurence of clinical signs may be non-linear we also considered
general additive models in order to model the age - outcome relationship in an as flexible way as possible. After the preliminary analyses shown below, it is this option that we selected for full analysis.
We then developped a number of pipelines combining both data preparation and data analysis. pipeline1()
is based on the Fisher exact test, pipeline3()
is based on the GLM and pipeline5()
is based on the GAM. pipeline2()
, pipeline4()
and pipeline6()
are the same as pipeline1()
, pipeline3()
and pipeline5()
respectively, with simplified interfaces. At the end, we created 3 functions that allows to explore the exposure - outcome relationships using various choices for the definitions of the duration of the first week, the filtering periods and the observation period, as well as the definition of the sets of antimicrobials and clinical signs to consider:
run_plan(pipeline, df, ...)
runs the pipeline pipeline
on the data frame df
using the sets of parameters as defined by ...
.
develop()
and simplify()
allow to manipulate the output of run_plan()
.
Installing the required packages:
installed_packages <- row.names(installed.packages())
required <- c("dplyr", "magrittr", "mgcv", "purrr", "readr")
to_install <- setdiff(required, installed_packages)
if (length(to_install)) install.packages(to_install)
if (! require("mgcv.helper")) {
if (! require("remotes")) install.packages("remotes")
remotes::install_github("samclifford/mgcv.helper")
}
Loading magrittr
for interactive use:
library(magrittr)
viparc <- readr::read_csv("https://raw.githubusercontent.com/viparc/prophylactic/master/data/viparc_qualitative.csv",
col_types = paste(c("ciiiddd", rep("l", 51)), collapse = "")) %>%
dplyr::select(-nb_chicken, -chicken_disease_death, -chicken_sudden_death, -nb_chicken_sold)
We have 3 types of variables in this data frame:
farm
, flock
and week
;respiratory
, diarrhoea
, cns
, malaise
, leg_lesions
and sudden_death
;amoxicillin_use
to unknown_use
.For the purpose of the examples used below, we will used the following sets of antimicrobials and clinical signs:
am_set <- c("colistin_use", "oxytetracycline_use")
cs_set <- c("malaise", "diarrhoea")
The week we are interested in are the weeks for which
Here, we build the function prophylactic()
that helps performing this filtering. This function makes use of the function lagging()
defined below and that lags the variables var
of the data frame df
. Note that this later must have at least 3 columns named respectively farm
, flock
and week
.
lagging <- function(df, var, lag) {
require(magrittr)
stopifnot(c("farm", "flock", "week", var) %in% names(df))
df %>%
dplyr::select(!! c("farm", "flock", "week", var)) %>%
dplyr::mutate(week = week + lag) %>%
dplyr::rename_at(var, paste0, lag)
}
Let’s try it:
lapply(0:2, lagging, df = viparc, var = am_set) %>%
purrr::reduce(dplyr::full_join, c("farm", "flock", "week")) %>%
dplyr::left_join(viparc, ., c("farm", "flock", "week")) %>%
dplyr::select(farm, flock, week, dplyr::matches("colistin|oxytetracycline"))
## # A tibble: 5,566 x 11
## farm flock week colistin_use oxytetracycline… colistin_use0 oxytetracycline… colistin_use1 oxytetracycline… colistin_use2
## <chr> <int> <int> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 75-0… 1 1 TRUE TRUE TRUE TRUE NA NA NA
## 2 75-0… 1 2 FALSE FALSE FALSE FALSE TRUE TRUE NA
## 3 75-0… 1 3 FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## 4 75-0… 1 4 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 5 75-0… 1 5 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 6 75-0… 1 6 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 7 75-0… 1 7 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 8 75-0… 1 8 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 9 75-0… 1 9 FALSE TRUE FALSE TRUE FALSE FALSE FALSE
## 10 75-0… 1 10 FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## # … with 5,556 more rows, and 1 more variable: oxytetracycline_use2 <lgl>
It works as expected (note that we could have used the dplyr::lag()
function too). Now we can define the prophylactic()
function that uses the above-defined lagging()
function:
prophylactic <- function(df, clinical_signs, antimicrobials, lcs, lam) {
require(magrittr)
week_id <- c("farm", "flock", "week")
purrr::map2(list(c(0, seq_len(lcs)), seq_len(lam)), # doing the lagging on clinical signs and AMU
list(clinical_signs, antimicrobials),
~ purrr::map(.x, lagging, df = df, var = .y)) %>%
unlist(FALSE) %>%
purrr::reduce(dplyr::left_join, week_id) %>% # putting together into a single data frames
# note: we don't need to join with the original data because the original data
# is already here thanks to the lag of 0 in the above lagging() call to the clinical signs.
dplyr::select(-week_id) %>%
dplyr::mutate_all(dplyr::coalesce, FALSE) %>% # we want to get rid off FALSEs and also NAs
rowSums() %>%
not()
}
Let’s try it:
viparc %>%
dplyr::mutate(prphlctc = prophylactic(., cs_set, am_set, 2, 2)) %>%
dplyr::select(farm, flock, week, am_set, cs_set, prphlctc) %>%
head(20)
## # A tibble: 20 x 8
## farm flock week colistin_use oxytetracycline_use malaise diarrhoea prphlctc
## <chr> <int> <int> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 75-001 1 1 TRUE TRUE FALSE FALSE TRUE
## 2 75-001 1 2 FALSE FALSE FALSE FALSE FALSE
## 3 75-001 1 3 FALSE FALSE FALSE FALSE FALSE
## 4 75-001 1 4 FALSE FALSE FALSE FALSE TRUE
## 5 75-001 1 5 FALSE FALSE FALSE FALSE TRUE
## 6 75-001 1 6 FALSE FALSE FALSE FALSE TRUE
## 7 75-001 1 7 FALSE FALSE FALSE FALSE TRUE
## 8 75-001 1 8 FALSE FALSE FALSE FALSE TRUE
## 9 75-001 1 9 FALSE TRUE FALSE FALSE TRUE
## 10 75-001 1 10 FALSE FALSE FALSE FALSE FALSE
## 11 75-001 1 11 TRUE FALSE FALSE FALSE FALSE
## 12 75-001 1 12 FALSE FALSE FALSE FALSE FALSE
## 13 75-001 1 13 FALSE FALSE FALSE FALSE FALSE
## 14 75-001 1 14 FALSE FALSE FALSE FALSE TRUE
## 15 75-001 1 15 FALSE FALSE FALSE FALSE TRUE
## 16 75-001 1 16 TRUE FALSE FALSE FALSE TRUE
## 17 75-001 1 17 FALSE FALSE FALSE FALSE FALSE
## 18 75-001 1 18 FALSE FALSE FALSE FALSE FALSE
## 19 75-001 1 19 FALSE FALSE FALSE FALSE TRUE
## 20 75-001 1 20 FALSE FALSE FALSE FALSE TRUE
The filtering can be performed in one step:
viparc %>%
dplyr::filter(prophylactic(., cs_set, am_set, 2, 2))
## # A tibble: 2,942 x 54
## farm flock week respiratory diarrhoea cns malaise leg_lesions sudden_death amoxicillin_use ampicillin_use apramycin_use
## <chr> <int> <int> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 75-0… 1 1 FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## 2 75-0… 1 4 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 3 75-0… 1 5 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 4 75-0… 1 6 FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## 5 75-0… 1 7 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 6 75-0… 1 8 FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## 7 75-0… 1 9 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 8 75-0… 1 14 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 9 75-0… 1 15 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 10 75-0… 1 16 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## # … with 2,932 more rows, and 42 more variables: cefadroxil_use <lgl>, cefotaxime_use <lgl>, ceftiofur_use <lgl>, cephalexin_use <lgl>,
## # chloramphenicol_use <lgl>, ciprofloxacin_use <lgl>, colistin_use <lgl>, doxycycline_use <lgl>, enramycin_use <lgl>,
## # enrofloxacin_use <lgl>, erythromycin_use <lgl>, florfenicol_use <lgl>, flumequine_use <lgl>, gentamicin_use <lgl>,
## # josamycin_use <lgl>, kitasamycin_use <lgl>, lincomycin_use <lgl>, marbofloxacin_use <lgl>, methenamine_use <lgl>,
## # neomycin_use <lgl>, norfloxacin_use <lgl>, oxytetracycline_use <lgl>, spectinomycin_use <lgl>, spiramycin_use <lgl>,
## # streptomycin_use <lgl>, sulfachloropyridazine_use <lgl>, sulfadiazine_use <lgl>, sulfadimethoxine_use <lgl>,
## # sulfadimidine_use <lgl>, sulfaguanidin_use <lgl>, sulfamethazine_use <lgl>, sulfamethoxazole_use <lgl>,
## # sulfamethoxypyridazine_use <lgl>, sulphamethoxazole_use <lgl>, sulphathiazole_use <lgl>, tetracycline_use <lgl>,
## # thiamphenicol_use <lgl>, tiamulin_use <lgl>, tilmicosin_use <lgl>, trimethoprim_use <lgl>, tylosin_use <lgl>, unknown_use <lgl>
Seems to work as expected.
The make_outcome()
function is another use of the lagging()
function:
make_outcome <- function(df, outcome, loc) {
require(magrittr)
week_id <- c("farm", "flock", "week")
lapply(-seq_len(loc), lagging, df = df, var = outcome) %>% # this is a list of loc slots
purrr::reduce(dplyr::left_join, week_id) %>% # puts together into a single data frame
dplyr::left_join(df, ., week_id) %>% # merges with original data frame
dplyr::select(dplyr::matches("-\\d+$")) %>%
rowSums() %>%
as.logical()
}
where outcome
is a vector of clinical signs names used to compute the outcome, loc
is the duration, in weeks, of the observation period. Let’s try it:
viparc[356:386, ] %>%
dplyr::mutate(disease = make_outcome(., cs_set, 2)) %>%
dplyr::select(farm, flock, week, cs_set, disease)
## # A tibble: 31 x 6
## farm flock week malaise diarrhoea disease
## <chr> <int> <int> <lgl> <lgl> <lgl>
## 1 75-004 7 13 FALSE FALSE FALSE
## 2 75-004 7 14 FALSE FALSE FALSE
## 3 75-004 7 15 FALSE FALSE FALSE
## 4 75-004 7 16 FALSE FALSE NA
## 5 75-004 7 17 FALSE FALSE NA
## 6 75-005 1 1 TRUE TRUE TRUE
## 7 75-005 1 2 TRUE TRUE TRUE
## 8 75-005 1 3 TRUE TRUE FALSE
## 9 75-005 1 4 FALSE FALSE FALSE
## 10 75-005 1 5 FALSE FALSE FALSE
## # … with 21 more rows
Seems to work as expected. Note the missing values at the end of the flocks.
The make_exposure()
function is very simple:
make_exposure <- function(df, antimicrobials) {
require(magrittr)
df %>%
dplyr::select(antimicrobials) %>%
rowSums() %>%
as.logical()
}
Where antimicrobials
is a vector of antimicrobials names that should be considered to compute the exposure. Let’s try it:
viparc %>%
dplyr::mutate(amu = make_exposure(., am_set)) %>%
dplyr::select(farm, flock, week, am_set, amu)
## # A tibble: 5,566 x 6
## farm flock week colistin_use oxytetracycline_use amu
## <chr> <int> <int> <lgl> <lgl> <lgl>
## 1 75-001 1 1 TRUE TRUE TRUE
## 2 75-001 1 2 FALSE FALSE FALSE
## 3 75-001 1 3 FALSE FALSE FALSE
## 4 75-001 1 4 FALSE FALSE FALSE
## 5 75-001 1 5 FALSE FALSE FALSE
## 6 75-001 1 6 FALSE FALSE FALSE
## 7 75-001 1 7 FALSE FALSE FALSE
## 8 75-001 1 8 FALSE FALSE FALSE
## 9 75-001 1 9 FALSE TRUE TRUE
## 10 75-001 1 10 FALSE FALSE FALSE
## # … with 5,556 more rows
Seems to work as expected.
The final statistical analysis can be performed correcting for potential confounders. Potential confounders that we may want to consider are the AMU during the first few weeks of the cycle, AMU during the observation period, and the age of the flock. For the latter one, the variable is simply the week number. The two other variables are respectively computed by the first_weeks()
and make_outcome()
functions as detailed below.
The following function computes a boolean variable that specifies whether any of the antimicrobials defined in the antimicrobials
vector has been used during the first wk
weeks of the flock’s life:
first_weeks <- function(df, antimicrobials, wk = 1) {
df %>%
dplyr::filter(week < wk + 1) %>%
dplyr::group_by(farm, flock) %>%
dplyr::group_modify(~ tibble::tibble(out = any(unlist(.x[, antimicrobials])))) %>%
dplyr::ungroup() %>%
dplyr::left_join(df, ., c("farm", "flock")) %>%
dplyr::select(out) %>%
unlist()
}
Let’s try it:
viparc %>%
dplyr::mutate(amu_first_weeks = first_weeks(., am_set)) %>%
dplyr::select(farm, flock, week, am_set, amu_first_weeks)
## # A tibble: 5,566 x 6
## farm flock week colistin_use oxytetracycline_use amu_first_weeks
## <chr> <int> <int> <lgl> <lgl> <lgl>
## 1 75-001 1 1 TRUE TRUE TRUE
## 2 75-001 1 2 FALSE FALSE TRUE
## 3 75-001 1 3 FALSE FALSE TRUE
## 4 75-001 1 4 FALSE FALSE TRUE
## 5 75-001 1 5 FALSE FALSE TRUE
## 6 75-001 1 6 FALSE FALSE TRUE
## 7 75-001 1 7 FALSE FALSE TRUE
## 8 75-001 1 8 FALSE FALSE TRUE
## 9 75-001 1 9 FALSE TRUE TRUE
## 10 75-001 1 10 FALSE FALSE TRUE
## # … with 5,556 more rows
And with the first 2 weeks:
viparc %>%
dplyr::mutate(amu_first_weeks = first_weeks(., am_set, 2)) %>%
dplyr::select(farm, flock, week, am_set, amu_first_weeks)
## # A tibble: 5,566 x 6
## farm flock week colistin_use oxytetracycline_use amu_first_weeks
## <chr> <int> <int> <lgl> <lgl> <lgl>
## 1 75-001 1 1 TRUE TRUE TRUE
## 2 75-001 1 2 FALSE FALSE TRUE
## 3 75-001 1 3 FALSE FALSE TRUE
## 4 75-001 1 4 FALSE FALSE TRUE
## 5 75-001 1 5 FALSE FALSE TRUE
## 6 75-001 1 6 FALSE FALSE TRUE
## 7 75-001 1 7 FALSE FALSE TRUE
## 8 75-001 1 8 FALSE FALSE TRUE
## 9 75-001 1 9 FALSE TRUE TRUE
## 10 75-001 1 10 FALSE FALSE TRUE
## # … with 5,556 more rows
Looking at another cycle from another farm:
viparc %>%
dplyr::mutate(amu_first_weeks = first_weeks(., am_set)) %>%
dplyr::select(farm, flock, week, am_set, amu_first_weeks) %>%
dplyr::filter(farm == "75-013")
## # A tibble: 54 x 6
## farm flock week colistin_use oxytetracycline_use amu_first_weeks
## <chr> <int> <int> <lgl> <lgl> <lgl>
## 1 75-013 1 1 TRUE FALSE TRUE
## 2 75-013 1 2 TRUE FALSE TRUE
## 3 75-013 1 3 FALSE FALSE TRUE
## 4 75-013 1 4 FALSE FALSE TRUE
## 5 75-013 1 5 FALSE FALSE TRUE
## 6 75-013 1 6 FALSE FALSE TRUE
## 7 75-013 1 7 FALSE FALSE TRUE
## 8 75-013 1 8 FALSE FALSE TRUE
## 9 75-013 1 9 FALSE FALSE TRUE
## 10 75-013 1 10 TRUE FALSE TRUE
## # … with 44 more rows
Seems to work as expected.
If one wants to compute a covariable accounting for AMU during the watching period, this can simply be done by reusing the make_outcome()
function defined above:
viparc %>%
dplyr::mutate(amu_observation = make_outcome(., am_set, 2)) %>%
dplyr::select(farm, flock, week, am_set, amu_observation) %>%
head(20)
## # A tibble: 20 x 6
## farm flock week colistin_use oxytetracycline_use amu_observation
## <chr> <int> <int> <lgl> <lgl> <lgl>
## 1 75-001 1 1 TRUE TRUE FALSE
## 2 75-001 1 2 FALSE FALSE FALSE
## 3 75-001 1 3 FALSE FALSE FALSE
## 4 75-001 1 4 FALSE FALSE FALSE
## 5 75-001 1 5 FALSE FALSE FALSE
## 6 75-001 1 6 FALSE FALSE FALSE
## 7 75-001 1 7 FALSE FALSE TRUE
## 8 75-001 1 8 FALSE FALSE TRUE
## 9 75-001 1 9 FALSE TRUE TRUE
## 10 75-001 1 10 FALSE FALSE TRUE
## 11 75-001 1 11 TRUE FALSE FALSE
## 12 75-001 1 12 FALSE FALSE FALSE
## 13 75-001 1 13 FALSE FALSE FALSE
## 14 75-001 1 14 FALSE FALSE TRUE
## 15 75-001 1 15 FALSE FALSE TRUE
## 16 75-001 1 16 TRUE FALSE FALSE
## 17 75-001 1 17 FALSE FALSE FALSE
## 18 75-001 1 18 FALSE FALSE FALSE
## 19 75-001 1 19 FALSE FALSE FALSE
## 20 75-001 1 20 FALSE FALSE FALSE
Here we show how to assemble the functions prophylactic()
, make_exposure()
, make_outcome()
and first_weeks()
into a pipeline to test a hypothesis.
WARNING 1: in these pipelines, the filtering of the weeks should always be the last step of the data preparation, just before the statistical analysis.
WARNING 2: na.exclude()
calls should always be done after the selection of variables.
For the sake of the examples, we will consider the following data:
(data1 <- viparc %>%
dplyr::mutate(amu = make_exposure(., am_set),
disease = make_outcome(., cs_set, 2)) %>%
dplyr::filter(prophylactic(., cs_set, am_set, 2, 2)) %>%
dplyr::select(amu, disease) %>%
na.exclude())
## # A tibble: 2,412 x 2
## amu disease
## <lgl> <lgl>
## 1 TRUE FALSE
## 2 FALSE FALSE
## 3 FALSE FALSE
## 4 FALSE FALSE
## 5 FALSE FALSE
## 6 FALSE FALSE
## 7 TRUE FALSE
## 8 FALSE FALSE
## 9 FALSE FALSE
## 10 TRUE FALSE
## # … with 2,402 more rows
Simple use of a Fisher exact test, without any co-variable:
data1 %$%
table(amu, disease) %T>% print() %>%
fisher.test()
## disease
## amu FALSE TRUE
## FALSE 1933 276
## TRUE 149 54
##
## Fisher's Exact Test for Count Data
##
## data: .
## p-value = 3.037e-07
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.777194 3.582548
## sample estimates:
## odds ratio
## 2.536786
A logistic regression can equivalently be used:
data1 %>%
glm(disease ~ amu, binomial, .) %>%
summary()
##
## Call:
## glm(formula = disease ~ amu, family = binomial, data = .)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7865 -0.5167 -0.5167 -0.5167 2.0396
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.94643 0.06435 -30.249 < 2e-16 ***
## amuTRUE 0.93147 0.17138 5.435 5.47e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1925.5 on 2411 degrees of freedom
## Residual deviance: 1899.3 on 2410 degrees of freedom
## AIC: 1903.3
##
## Number of Fisher Scoring iterations: 4
where the estimate of the coefficient (amuTRUE
) is the log odds ratio.
In this section we address the problem of confounders and explore on an example different options to control for them.
Before we start this section, let’s define another utilitary function that adds a vector of suffixes suff
to a character vector x
where we want all possible combinations of x
and suff
:
add_suffix <- function(x, suff) {
require(magrittr)
expand.grid(x, suff) %>%
purrr::reduce(purrr::map2, paste0) %>%
unlist()
}
Let’s try it:
add_suffix(letters[1:5], 1:3)
## [1] "a1" "b1" "c1" "d1" "e1" "a2" "b2" "c2" "d2" "e2" "a3" "b3" "c3" "d3" "e3"
Let’s now consider the following example:
(data2 <- viparc %>%
dplyr::mutate(amu = make_exposure(., am_set),
disease = make_outcome(., cs_set, 2),
amu_first_weeks = first_weeks(., am_set),
amu_observation = make_outcome(., am_set, 2)) %>%
dplyr::filter(prophylactic(., cs_set, am_set, 2, 2)) %>%
dplyr::select(amu, disease, week, amu_first_weeks, amu_observation))
## # A tibble: 2,942 x 5
## amu disease week amu_first_weeks amu_observation
## <lgl> <lgl> <int> <lgl> <lgl>
## 1 TRUE FALSE 1 TRUE FALSE
## 2 FALSE FALSE 4 TRUE FALSE
## 3 FALSE FALSE 5 TRUE FALSE
## 4 FALSE FALSE 6 TRUE FALSE
## 5 FALSE FALSE 7 TRUE TRUE
## 6 FALSE FALSE 8 TRUE TRUE
## 7 TRUE FALSE 9 TRUE TRUE
## 8 FALSE FALSE 14 TRUE TRUE
## 9 FALSE FALSE 15 TRUE TRUE
## 10 TRUE FALSE 16 TRUE FALSE
## # … with 2,932 more rows
As we can see below, both exposure (amu
) and outcome (disease
) depend on week
, amu_first_weeks
and amu_observation
:
purrr::map(add_suffix(c("amu", "disease"), paste(" ~", c("week", "amu_first_weeks", "amu_observation"))),
~ .x %>%
formula() %>%
glm(binomial, na.exclude(data2)) %>%
summary() %>%
coef() %>%
`[`(2, 4) %>%
data.frame(model = .x, p = ., stringsAsFactors = FALSE)) %>%
dplyr::bind_rows()
## model p
## 1 amu ~ week 3.103919e-57
## 2 disease ~ week 6.290267e-42
## 3 amu ~ amu_first_weeks 6.144753e-12
## 4 disease ~ amu_first_weeks 3.284952e-05
## 5 amu ~ amu_observation 9.720589e-32
## 6 disease ~ amu_observation 3.672945e-15
These three variables week
, amu_first_weeks
and amu_observation
are thus potential confounders that need to be corrected for.
If one wants to consider (and correct for) covariables – such as age, AMU on the first week or AMU during the observation period –, the pipeline would look as below:
data2 %>%
na.exclude() %T>%
{dplyr::group_by(., amu, disease, amu_first_weeks, amu_observation) %>%
dplyr::tally() %>%
dplyr::ungroup() %>%
print()} %>%
glm(disease ~ amu_first_weeks + amu_observation + week + amu, binomial, .) %T>%
{summary(.) %>% print()} %>%
anova(test = "LRT")
## # A tibble: 16 x 5
## amu disease amu_first_weeks amu_observation n
## <lgl> <lgl> <lgl> <lgl> <int>
## 1 FALSE FALSE FALSE FALSE 630
## 2 FALSE FALSE FALSE TRUE 23
## 3 FALSE FALSE TRUE FALSE 1219
## 4 FALSE FALSE TRUE TRUE 61
## 5 FALSE TRUE FALSE FALSE 117
## 6 FALSE TRUE FALSE TRUE 20
## 7 FALSE TRUE TRUE FALSE 110
## 8 FALSE TRUE TRUE TRUE 29
## 9 TRUE FALSE FALSE FALSE 3
## 10 TRUE FALSE FALSE TRUE 9
## 11 TRUE FALSE TRUE FALSE 96
## 12 TRUE FALSE TRUE TRUE 41
## 13 TRUE TRUE FALSE FALSE 6
## 14 TRUE TRUE FALSE TRUE 1
## 15 TRUE TRUE TRUE FALSE 31
## 16 TRUE TRUE TRUE TRUE 16
##
## Call:
## glm(formula = disease ~ amu_first_weeks + amu_observation + week +
## amu, family = binomial, data = .)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3183 -0.5354 -0.3893 -0.2808 2.8530
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.19480 0.15014 -1.297 0.194486
## amu_first_weeksTRUE -0.33233 0.13632 -2.438 0.014775 *
## amu_observationTRUE 0.68800 0.18153 3.790 0.000151 ***
## week -0.16788 0.01515 -11.078 < 2e-16 ***
## amuTRUE -0.28454 0.21318 -1.335 0.181959
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1925.5 on 2411 degrees of freedom
## Residual deviance: 1693.1 on 2407 degrees of freedom
## AIC: 1703.1
##
## Number of Fisher Scoring iterations: 5
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: disease
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 2411 1925.5
## amu_first_weeks 1 16.900 2410 1908.6 3.939e-05 ***
## amu_observation 1 58.215 2409 1850.3 2.350e-14 ***
## week 1 155.396 2408 1695.0 < 2.2e-16 ***
## amu 1 1.812 2407 1693.1 0.1783
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
NOTE: If one wanted to correct for potential farm and / or flock effect, he’d have to consider mixed effect models (not done here). (Not even sure we need to do this here.)
Note that accounting for potential confounders makes the effect of the AMU exposure not significant anymore. Furthermore the estimated odds ratio is now below 1. Let’s try to drop the week
variable:
data2 %>%
dplyr::select(-week) %>%
na.exclude() %>%
glm(disease ~ amu_first_weeks + amu_observation + amu, binomial, .) %T>% {summary(.) %>% print()} %>%
anova(test = "LRT")
##
## Call:
## glm(formula = disease ~ amu_first_weeks + amu_observation + amu,
## family = binomial, data = .)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2952 -0.5907 -0.4334 -0.4334 2.1963
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.65752 0.09511 -17.427 < 2e-16 ***
## amu_first_weeksTRUE -0.66054 0.12668 -5.214 1.85e-07 ***
## amu_observationTRUE 1.14139 0.17587 6.490 8.59e-11 ***
## amuTRUE 0.78888 0.19154 4.119 3.81e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1925.5 on 2411 degrees of freedom
## Residual deviance: 1834.7 on 2408 degrees of freedom
## AIC: 1842.7
##
## Number of Fisher Scoring iterations: 4
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: disease
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 2411 1925.5
## amu_first_weeks 1 16.900 2410 1908.6 3.939e-05 ***
## amu_observation 1 58.215 2409 1850.3 2.350e-14 ***
## amu 1 15.675 2408 1834.7 7.521e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
which suggests that the week
variable (i.e. the age of the flock) is really the variable that causes the spurious relationship between prophylactic use (amu
) and occurence of infection (disease
). Let’s try again with week
but this time also including interaction with amu
:
data2 %>%
na.exclude() %>%
glm(disease ~ amu_first_weeks + amu_observation + week * amu, binomial, .) %>%
anova(test = "LRT")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: disease
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 2411 1925.5
## amu_first_weeks 1 16.900 2410 1908.6 3.939e-05 ***
## amu_observation 1 58.215 2409 1850.3 2.350e-14 ***
## week 1 155.396 2408 1695.0 < 2.2e-16 ***
## amu 1 1.812 2407 1693.1 0.1783
## week:amu 1 1.182 2406 1692.0 0.2769
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It doesn’t really help. Let’s try modeling the age effect in a more flexible way than a linear relationship, using GAM:
summary(mgcv::gam(disease ~ amu_first_weeks + amu_observation + s(week) + amu,
binomial, na.exclude(data2)))
##
## Family: binomial
## Link function: logit
##
## Formula:
## disease ~ amu_first_weeks + amu_observation + s(week) + amu
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.95978 0.11121 -17.622 < 2e-16 ***
## amu_first_weeksTRUE -0.41279 0.13906 -2.968 0.002994 **
## amu_observationTRUE 0.68377 0.18023 3.794 0.000148 ***
## amuTRUE 0.07168 0.23712 0.302 0.762441
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(week) 4.012 4.929 134 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.107 Deviance explained = 12.9%
## UBRE = -0.29821 Scale est. = 1 n = 2412
Not better. Let’s try to stratify. For example, under the age of 6 weeks there seems to have not much effect of week
, amu_first_weeks
and amu_observation
. Look at the significativity of the confounders for the 1-5-week age strata:
data2 %>%
dplyr::filter(week < 6) %>%
na.exclude() %>%
glm(disease ~ amu_first_weeks + amu_observation + week + amu, binomial, .) %>%
anova(test = "LRT")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: disease
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 471 594.68
## amu_first_weeks 1 5.2854 470 589.40 0.0215 *
## amu_observation 1 2.3992 469 587.00 0.1214
## week 1 0.1597 468 586.84 0.6894
## amu 1 1.7539 467 585.08 0.1854
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
and for the 1-6-week age strata:
data2 %>%
dplyr::filter(week < 7) %>%
na.exclude() %>%
glm(disease ~ amu_first_weeks + amu_observation + week + amu, binomial, .) %>%
anova(test = "LRT")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: disease
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 576 713.04
## amu_first_weeks 1 8.5731 575 704.47 0.003412 **
## amu_observation 1 2.7580 574 701.71 0.096770 .
## week 1 0.2318 573 701.48 0.630219
## amu 1 1.5159 572 699.97 0.218239
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s thus consider the 6-weeks threshold and investigate the relationship between disease
and amu
:
data2 %>%
dplyr::filter(week < 6) %>%
dplyr::select(amu, disease) %>%
na.exclude() %>%
glm(disease ~ amu, binomial, .) %>%
anova(test = "LRT")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: disease
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 471 594.68
## amu 1 0.075703 470 594.61 0.7832
No significant relationship between prophylaxy and disease occurence in the 1-to-6-week age band.
In this section we show how to build pipelines that combine data preparation and data analysis.
Let’s redefine a Fisher exact test function that returns the results of the test together with the input contingency table:
fisher_test <- function(x, ...) {
list(x, fisher.test(x, ...))
}
Let’s define a first pipeline using this new function:
pipeline1 <- function(df, clinical_signs_filtering = NULL, antimicrobials_filtering = NULL, lcs,
lam, antimicrobials_input = NULL, clinical_signs_output = NULL, watching, ...) {
require(magrittr)
var_names <- setdiff(names(df), c("farm", "flock", "week"))
cs_names <- grep("_use$", var_names, value = TRUE, invert = TRUE)
am_names <- grep("_use$", var_names, value = TRUE)
if (is.null(clinical_signs_filtering)) clinical_signs_filtering <- cs_names
if (is.null(clinical_signs_output)) clinical_signs_output <- cs_names
if (is.null(antimicrobials_filtering)) antimicrobials_filtering <- am_names
if (is.null(antimicrobials_input)) antimicrobials_input <- am_names
df %>%
dplyr::mutate(amu = make_exposure(., antimicrobials_input), # creates amu variable
disease = make_outcome(., clinical_signs_output, watching)) %>% # creates disease variable
dplyr::filter(prophylactic(., clinical_signs_filtering, antimicrobials_filtering, lcs, lam)) %>%
dplyr::select(amu, disease) %>%
na.exclude() %$%
table(amu, disease) %>%
fisher_test(., ...)
}
Let’s try it:
pipeline1(viparc,
# 1. FILTERING -----------------------------------------------------------------
clinical_signs_filtering = cs_set,
antimicrobials_filtering = am_set,
lcs = 2, # the duration over which we perform the filtering on clinical signs
lam = 2, # the duration over which we perform the filtering on antimicrobials use
# 2. INPUT ---------------------------------------------------------------------
antimicrobials_input = am_set,
# 3. OUTPUT --------------------------------------------------------------------
clinical_signs_output = cs_set,
watching = 2) # the duration of the observation period
## [[1]]
## disease
## amu FALSE TRUE
## FALSE 1933 276
## TRUE 149 54
##
## [[2]]
##
## Fisher's Exact Test for Count Data
##
## data: x
## p-value = 3.037e-07
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.777194 3.582548
## sample estimates:
## odds ratio
## 2.536786
A simplified version of pipeline1()
could be:
pipeline2 <- function(df, clinical_signs = NULL, antimicrobials = NULL, filtering, watching, ...)
pipeline1(df, clinical_signs, antimicrobials, filtering, filtering, antimicrobials, clinical_signs, watching, ...)
And a call of this function would look like:
a <- pipeline2(viparc,
# 1. DEFINING THE CLINICAL SIGNS WE CONSIDER -----------------------------------
clinical_signs = cs_set,
# 2. DEFINING THE ANTIMICROBIALS USAGES WE CONSIDER ----------------------------
antimicrobials = am_set,
# 3. THE DURATION OF THE PERIOD BEFORE THE FOCAL WEEK OVER WHICH WE DO THE FILTERING
filtering = 2,
# 4. THE DURATION OF THE PERIOD AFTER THE FOCAL WEEK OVER WHICH WE OBSERVE -----
watching = 2)
Let’s explore the structure of the produced object:
str(a)
## List of 2
## $ : 'table' int [1:2, 1:2] 1933 149 276 54
## ..- attr(*, "dimnames")=List of 2
## .. ..$ amu : chr [1:2] "FALSE" "TRUE"
## .. ..$ disease: chr [1:2] "FALSE" "TRUE"
## $ :List of 7
## ..$ p.value : num 3.04e-07
## ..$ conf.int : num [1:2] 1.78 3.58
## .. ..- attr(*, "conf.level")= num 0.95
## ..$ estimate : Named num 2.54
## .. ..- attr(*, "names")= chr "odds ratio"
## ..$ null.value : Named num 1
## .. ..- attr(*, "names")= chr "odds ratio"
## ..$ alternative: chr "two.sided"
## ..$ method : chr "Fisher's Exact Test for Count Data"
## ..$ data.name : chr "x"
## ..- attr(*, "class")= chr "htest"
Thanks to pipeline2()
, it’s very easy to explore different options for the analysis. For example, let’s compare this
pipeline2(viparc,
# 1. DEFINING THE CLINICAL SIGNS WE CONSIDER -----------------------------------
clinical_signs = cs_set,
# 2. DEFINING THE ANTIMICROBIALS USAGES WE CONSIDER ----------------------------
antimicrobials = am_set,
# 3. THE DURATION OF THE PERIOD BEFORE THE FOCAL WEEK OVER WHICH WE DO THE FILTERING
filtering = 2,
# 4. THE DURATION OF THE PERIOD AFTER THE FOCAL WEEK OVER WHICH WE OBSERVE -----
watching = 2)
## [[1]]
## disease
## amu FALSE TRUE
## FALSE 1933 276
## TRUE 149 54
##
## [[2]]
##
## Fisher's Exact Test for Count Data
##
## data: x
## p-value = 3.037e-07
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.777194 3.582548
## sample estimates:
## odds ratio
## 2.536786
with this:
pipeline2(viparc,
# 1. DEFINING THE CLINICAL SIGNS WE CONSIDER -----------------------------------
clinical_signs = c(cs_set, "respiratory"),
# 2. DEFINING THE ANTIMICROBIALS USAGES WE CONSIDER ----------------------------
antimicrobials = am_set,
# 3. THE DURATION OF THE PERIOD BEFORE THE FOCAL WEEK OVER WHICH WE DO THE FILTERING
filtering = 2,
# 4. THE DURATION OF THE PERIOD AFTER THE FOCAL WEEK OVER WHICH WE OBSERVE -----
watching = 2)
## [[1]]
## disease
## amu FALSE TRUE
## FALSE 1830 296
## TRUE 143 54
##
## [[2]]
##
## Fisher's Exact Test for Count Data
##
## data: x
## p-value = 3.191e-06
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.634042 3.296929
## sample estimates:
## odds ratio
## 2.333546
If you want all the clinical signs and all the antimicrobials:
pipeline2(viparc, filtering = 2, watching = 2)
## [[1]]
## disease
## amu FALSE TRUE
## FALSE 1342 244
## TRUE 142 62
##
## [[2]]
##
## Fisher's Exact Test for Count Data
##
## data: x
## p-value = 5.391e-07
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.698328 3.365424
## sample estimates:
## odds ratio
## 2.399929
We can use similar rationale in order to define pipeline functions that perform logistic regressions with covariables:
pipeline3 <- function(df, clinical_signs_filtering = NULL, antimicrobials_filtering = NULL, lcs,
lam, antimicrobials_input = NULL, clinical_signs_output = NULL, watching,
antimicrobials_week1 = NULL, wk, antimicrobials_watching = NULL) {
require(magrittr)
var_names <- setdiff(names(df), c("farm", "flock", "week"))
cs_names <- grep("_use$", var_names, value = TRUE, invert = TRUE)
am_names <- grep("_use$", var_names, value = TRUE)
if (is.null(clinical_signs_filtering)) clinical_signs_filtering <- cs_names
if (is.null(clinical_signs_output)) clinical_signs_output <- cs_names
if (is.null(antimicrobials_filtering)) antimicrobials_filtering <- am_names
if (is.null(antimicrobials_input)) antimicrobials_input <- am_names
if (is.null(antimicrobials_week1)) antimicrobials_week1 <- am_names
if (is.null(antimicrobials_watching)) antimicrobials_watching <- am_names
df %>%
dplyr::mutate(amu = make_exposure(., antimicrobials_input),
disease = make_outcome(., clinical_signs_output, watching),
amu_first_weeks = first_weeks(., antimicrobials_week1, wk),
amu_observation = make_outcome(., antimicrobials_watching, watching)) %>%
dplyr::filter(prophylactic(., clinical_signs_filtering, antimicrobials_filtering, lcs, lam)) %>%
dplyr::select(farm, flock, week, amu_first_weeks, amu_observation, amu, disease) %>%
na.exclude() %>%
glm(disease ~ amu_first_weeks + amu_observation + week + amu, binomial, .)
}
A wrapper with reduced flexibility in the interface:
pipeline4 <- function(df, clinical_signs = NULL, antimicrobials = NULL, filtering, watching, first_wks)
pipeline3(df, clinical_signs, antimicrobials, filtering, filtering, antimicrobials, clinical_signs, watching, antimicrobials, first_wks, antimicrobials)
Let’s try it:
pipeline4(viparc, filtering = 2, watching = 2, first_wks = 1) %T>%
{summary(.) %>% print()} %>%
anova(test = "LRT")
##
## Call:
## glm(formula = disease ~ amu_first_weeks + amu_observation + week +
## amu, family = binomial, data = .)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5733 -0.5509 -0.4180 -0.3150 2.6803
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.24475 0.17737 -1.380 0.1676
## amu_first_weeksTRUE -0.24716 0.15040 -1.643 0.1003
## amu_observationTRUE 1.28602 0.16194 7.941 2e-15 ***
## week -0.14630 0.01541 -9.492 <2e-16 ***
## amuTRUE -0.41588 0.21103 -1.971 0.0488 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1637.5 on 1789 degrees of freedom
## Residual deviance: 1396.3 on 1785 degrees of freedom
## AIC: 1406.3
##
## Number of Fisher Scoring iterations: 5
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: disease
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 1789 1637.5
## amu_first_weeks 1 8.602 1788 1628.8 0.003357 **
## amu_observation 1 129.122 1787 1499.7 < 2.2e-16 ***
## week 1 99.418 1786 1400.3 < 2.2e-16 ***
## amu 1 3.993 1785 1396.3 0.045695 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Same thing as above but using a GAM instead of a GLM.:
pipeline5 <- function(df, clinical_signs_filtering = NULL, antimicrobials_filtering = NULL, lcs,
lam, antimicrobials_input = NULL, clinical_signs_output = NULL, watching,
antimicrobials_week1 = NULL, wk, antimicrobials_watching = NULL) {
require(magrittr)
var_names <- setdiff(names(df), c("farm", "flock", "week"))
cs_names <- grep("_use$", var_names, value = TRUE, invert = TRUE)
am_names <- grep("_use$", var_names, value = TRUE)
if (is.null(clinical_signs_filtering)) clinical_signs_filtering <- cs_names
if (is.null(clinical_signs_output)) clinical_signs_output <- cs_names
if (is.null(antimicrobials_filtering)) antimicrobials_filtering <- am_names
if (is.null(antimicrobials_input)) antimicrobials_input <- am_names
if (is.null(antimicrobials_week1)) antimicrobials_week1 <- am_names
if (is.null(antimicrobials_watching)) antimicrobials_watching <- am_names
df %>%
dplyr::mutate(amu = make_exposure(., antimicrobials_input),
disease = make_outcome(., clinical_signs_output, watching),
amu_first_weeks = first_weeks(., antimicrobials_week1, wk),
amu_observation = make_outcome(., antimicrobials_watching, watching)) %>%
dplyr::filter(prophylactic(., clinical_signs_filtering, antimicrobials_filtering, lcs, lam)) %>%
dplyr::select(farm, flock, week, amu_first_weeks, amu_observation, amu, disease) %>%
na.exclude() %>%
mgcv::gam(disease ~ amu_first_weeks + amu_observation + s(week) + amu, binomial, ., method = "REML")
}
And the wrapper with the simple interface:
pipeline6 <- function(df, clinical_signs = NULL, antimicrobials = NULL, filtering, watching, first_wks)
pipeline5(df, clinical_signs, antimicrobials, filtering, filtering, antimicrobials, clinical_signs, watching, antimicrobials, first_wks, antimicrobials)
Let’s try it:
pipeline6(viparc, filtering = 2, watching = 2, first_wks = 1) %>%
summary()
##
## Family: binomial
## Link function: logit
##
## Formula:
## disease ~ amu_first_weeks + amu_observation + s(week) + amu
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.86077 0.13006 -14.307 < 2e-16 ***
## amu_first_weeksTRUE -0.35715 0.15206 -2.349 0.0188 *
## amu_observationTRUE 1.24355 0.16230 7.662 1.83e-14 ***
## amuTRUE 0.01794 0.22197 0.081 0.9356
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(week) 4.097 5.021 111.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.147 Deviance explained = 16.5%
## -REML = 695.09 Scale est. = 1 n = 1790
Let’s define an S3 generic function that extracts all the information that we are interested in from the output of an object of class glm
or mgcv
:
get_outputs <- function (x, ...) {
UseMethod("get_outputs", x)
}
Let’s now define a glm
method for that generic function:
get_outputs.glm <- function(x) {
data <- x$model
m <- with(data, table(amu, disease))
conf_int <- confint(x)
list(n = nrow(data),
ct = addmargins(m),
prob = round(sweep(m, 1, rowSums(m), "/"), 2),
or = exp(tail(coef(x), 1)),
p = tail(anova(x, test = "LRT")$`Pr(>Chi)`, 1),
ci = conf_int[nrow(conf_int), ])
}
Let’s try it:
get_outputs(pipeline4(viparc, filtering = 2, watching = 2, first_wks = 1))
## $n
## [1] 1790
##
## $ct
## disease
## amu FALSE TRUE Sum
## FALSE 1342 244 1586
## TRUE 142 62 204
## Sum 1484 306 1790
##
## $prob
## disease
## amu FALSE TRUE
## FALSE 0.85 0.15
## TRUE 0.70 0.30
##
## $or
## amuTRUE
## 0.65976
##
## $p
## [1] 0.04569518
##
## $ci
## 2.5 % 97.5 %
## -0.835762763 -0.007863199
Let’s now define a batch of durations values:
(batch <- expand.grid(filtering = 1:2, watching = 1:2, first_wks = 1:2))
## filtering watching first_wks
## 1 1 1 1
## 2 2 1 1
## 3 1 2 1
## 4 2 2 1
## 5 1 1 2
## 6 2 1 2
## 7 1 2 2
## 8 2 2 2
And let’s run pipeline4()
on this batch:
batch %>%
purrr::pmap(pipeline4, df = viparc) %>%
lapply(get_outputs) %>%
head(2)
## [[1]]
## [[1]]$n
## [1] 2657
##
## [[1]]$ct
## disease
## amu FALSE TRUE Sum
## FALSE 2126 231 2357
## TRUE 242 58 300
## Sum 2368 289 2657
##
## [[1]]$prob
## disease
## amu FALSE TRUE
## FALSE 0.90 0.10
## TRUE 0.81 0.19
##
## [[1]]$or
## amuTRUE
## 0.8025573
##
## [[1]]$p
## [1] 0.2482423
##
## [[1]]$ci
## 2.5 % 97.5 %
## -0.6039546 0.1509620
##
##
## [[2]]
## [[2]]$n
## [1] 2023
##
## [[2]]$ct
## disease
## amu FALSE TRUE Sum
## FALSE 1671 146 1817
## TRUE 165 41 206
## Sum 1836 187 2023
##
## [[2]]$prob
## disease
## amu FALSE TRUE
## FALSE 0.92 0.08
## TRUE 0.80 0.20
##
## [[2]]$or
## amuTRUE
## 0.7962257
##
## [[2]]$p
## [1] 0.3487159
##
## [[2]]$ci
## 2.5 % 97.5 %
## -0.7178204 0.2446093
Let’s now define a gam
method for the get_outputs()
generic:
get_outputs.gam <- function(x, level = .95) {
data <- x$model
m <- with(data, table(amu, disease))
summ <- summary(x)
conf_int <- mgcv.helper::confint.gam(x, level = level)
list(n = nrow(data),
ct = addmargins(m),
prob = round(sweep(m, 1, rowSums(m), "/"), 2),
or = unname(tail(summ$p.coef, 1)),
p = unname(tail(summ$p.pv, 1)),
ci = unlist(conf_int[nrow(conf_int), 5:6]))
}
Let’s try it:
get_outputs(pipeline6(viparc, filtering = 2, watching = 2, first_wks = 1))
## $n
## [1] 1790
##
## $ct
## disease
## amu FALSE TRUE Sum
## FALSE 1342 244 1586
## TRUE 142 62 204
## Sum 1484 306 1790
##
## $prob
## disease
## amu FALSE TRUE
## FALSE 0.85 0.15
## TRUE 0.70 0.30
##
## $or
## [1] 0.0179444
##
## $p
## [1] 0.935568
##
## $ci
## 2.5% 97.5%
## -0.4174045 0.4532934
Let’s now run it on a batch of durations values, and put the output in shape:
batch %>%
tibble::as_tibble() %>%
assign("plan", ., 1) %>% # saving intermediate value
purrr::pmap(pipeline6, df = viparc) %>%
lapply(get_outputs) %>%
purrr::transpose() %>%
tibble::as_tibble() %>%
dplyr::mutate_at(c("n", "or", "p"), unlist) %>%
dplyr::bind_cols(plan, .) # intermediate value re-used here
## # A tibble: 8 x 9
## filtering watching first_wks n ct prob or p ci
## <int> <int> <int> <int> <list> <list> <dbl> <dbl> <list>
## 1 1 1 1 2657 <table> <table> 0.0357 0.855 <dbl [2]>
## 2 2 1 1 2023 <table> <table> 0.236 0.350 <dbl [2]>
## 3 1 2 1 2398 <table> <table> -0.205 0.232 <dbl [2]>
## 4 2 2 1 1790 <table> <table> 0.0179 0.936 <dbl [2]>
## 5 1 1 2 2657 <table> <table> 0.0692 0.723 <dbl [2]>
## 6 2 1 2 2023 <table> <table> 0.313 0.211 <dbl [2]>
## 7 1 2 2 2398 <table> <table> -0.171 0.318 <dbl [2]>
## 8 2 2 2 1790 <table> <table> 0.0741 0.737 <dbl [2]>
run_plan()
and CoLet’s make a function out of this above code:
run_plan <- function(f, df, clinical_signs = NULL, antimicrobials = NULL, filtering = 1, watching = 1, first_wks = 1, nb_cores = 1) {
require(magrittr)
var_names <- setdiff(names(df), c("farm", "flock", "week"))
all_antimicrobials <- list(grep("_use$", var_names, value = TRUE))
all_clinical_signs <- list(grep("_use$", var_names, value = TRUE, invert = TRUE))
if (is.null(antimicrobials)) antimicrobials <- all_antimicrobials
else if (! is.list(antimicrobials)) antimicrobials <- list(antimicrobials)
if (is.null(clinical_signs)) clinical_signs <- all_clinical_signs
else if (! is.list(clinical_signs)) clinical_signs <- list(clinical_signs)
is_na <- function(x) {
if (length(x) > 1) return(FALSE)
is.na(x)
}
sel <- which(sapply(antimicrobials, is_na))
if (length(sel)) antimicrobials[sel] <- all_antimicrobials
sel <- which(sapply(clinical_signs, is_na))
if (length(sel)) clinical_signs[sel] <- all_clinical_signs
rm(is_na, sel, all_antimicrobials, all_clinical_signs, var_names)
run <- function(x) {
if (nb_cores < 2) return(purrr::pmap(x, f, df = df))
c(FUN = f, x, MoreArgs = list(list(df = df)), SIMPLIFY = FALSE, mc.cores = nb_cores) %>%
do.call(parallel::mcmapply, .)
}
environment() %>%
as.list() %>%
`[<-`(c("f", "df", "run", "nb_cores"), NULL) %>%
expand.grid() %>%
assign("plan", ., 1) %>% # saving intermediate value
run() %>%
lapply(get_outputs) %>%
purrr::transpose() %>%
tibble::as_tibble() %>%
dplyr::mutate_at(c("n", "or", "p"), unlist) %>%
dplyr::bind_cols(tibble::as_tibble(plan), .) %>% # reusing the intermediate value
dplyr::mutate(lower = unlist(lapply(ci, head, 1)),
upper = unlist(lapply(ci, tail, 1))) %>%
dplyr::select(-ci)
}
Let’s also write two functions that will help the exploration of the outputs of the run_plan()
function. The first one develops the list of clinical signs and / or antimicrobials so that they can easily be read in the data frame:
develop <- function(df, var = NULL) {
names_df <- names(df)
var_ref <- c("clinical_signs", "antimicrobials")
if (is.null(var)) var <- intersect(names_df, var_ref)
stopifnot(var %in% names_df)
stopifnot(var %in% var_ref)
dplyr::mutate_at(df, var, ~ sapply(., paste, collapse = ", "))
}
The second one removes the columns that contain no variability:
simplify <- function(df) {
df[, sapply(df, function(x) length(unique(x))) > 1]
}
Let’s now try these three functions. With default values:
run_plan(pipeline6, viparc)
## # A tibble: 1 x 12
## clinical_signs antimicrobials filtering watching first_wks n ct prob or p lower upper
## <list> <list> <dbl> <dbl> <dbl> <int> <list> <list> <dbl> <dbl> <dbl> <dbl>
## 1 <chr [6]> <chr [45]> 1 1 1 2657 <table> <table> 0.0357 0.855 -0.348 0.420
By default, the set of all clinical signs and the set of all antimicrobials are considered, and the durations of the begining of the flock, the filtering period, and the observation period are fixed to 1 week. This can be changed by assigning values to the arguments of the run_plan()
function. For example, considering 2 different durations of the filtering period:
(tmp <- run_plan(pipeline6, viparc, filtering = 1:2))
## # A tibble: 2 x 12
## clinical_signs antimicrobials filtering watching first_wks n ct prob or p lower upper
## <list> <list> <int> <dbl> <dbl> <int> <list> <list> <dbl> <dbl> <dbl> <dbl>
## 1 <chr [6]> <chr [45]> 1 1 1 2657 <table> <table> 0.0357 0.855 -0.348 0.420
## 2 <chr [6]> <chr [45]> 2 1 1 2023 <table> <table> 0.236 0.350 -0.259 0.731
Note that you can access the contingency tables this way:
tmp$ct
## [[1]]
## disease
## amu FALSE TRUE Sum
## FALSE 2126 231 2357
## TRUE 242 58 300
## Sum 2368 289 2657
##
## [[2]]
## disease
## amu FALSE TRUE Sum
## FALSE 1671 146 1817
## TRUE 165 41 206
## Sum 1836 187 2023
Access to the probabilities would be done the same way:
tmp$prob
## [[1]]
## disease
## amu FALSE TRUE
## FALSE 0.90 0.10
## TRUE 0.81 0.19
##
## [[2]]
## disease
## amu FALSE TRUE
## FALSE 0.92 0.08
## TRUE 0.80 0.20
and same thing for the list of clinical signs and antimicrobials. If you want to see only the columns for which values change:
simplify(tmp)
## # A tibble: 2 x 8
## filtering n ct prob or p lower upper
## <int> <int> <list> <list> <dbl> <dbl> <dbl> <dbl>
## 1 1 2657 <table> <table> 0.0357 0.855 -0.348 0.420
## 2 2 2023 <table> <table> 0.236 0.350 -0.259 0.731
Let’s see a more complex example:
(tmp <- run_plan(pipeline6, viparc, filtering = 1:2, watching = 1:2,
antimicrobials = list("colistin_use", c("colistin_use", "cefotaxime_use")),
clinical_signs = "respiratory", first_wks = 1))
## # A tibble: 8 x 12
## clinical_signs antimicrobials filtering watching first_wks n ct prob or p lower upper
## <list> <list> <int> <int> <dbl> <int> <list> <list> <dbl> <dbl> <dbl> <dbl>
## 1 <chr [1]> <chr [1]> 1 1 1 4370 <table> <table> -0.251 0.537 -1.05 0.545
## 2 <chr [1]> <chr [2]> 1 1 1 4369 <table> <table> -0.253 0.534 -1.05 0.543
## 3 <chr [1]> <chr [1]> 2 1 1 3909 <table> <table> -0.280 0.566 -1.24 0.677
## 4 <chr [1]> <chr [2]> 2 1 1 3907 <table> <table> -0.280 0.566 -1.24 0.677
## 5 <chr [1]> <chr [1]> 1 2 1 4065 <table> <table> -0.127 0.664 -0.700 0.446
## 6 <chr [1]> <chr [2]> 1 2 1 4064 <table> <table> -0.130 0.657 -0.703 0.443
## 7 <chr [1]> <chr [1]> 2 2 1 3609 <table> <table> -0.0904 0.789 -0.752 0.572
## 8 <chr [1]> <chr [2]> 2 2 1 3607 <table> <table> -0.0917 0.786 -0.754 0.570
This output can be visualized with the simplify()
function:
simplify(tmp)
## # A tibble: 8 x 10
## antimicrobials filtering watching n ct prob or p lower upper
## <list> <int> <int> <int> <list> <list> <dbl> <dbl> <dbl> <dbl>
## 1 <chr [1]> 1 1 4370 <table> <table> -0.251 0.537 -1.05 0.545
## 2 <chr [2]> 1 1 4369 <table> <table> -0.253 0.534 -1.05 0.543
## 3 <chr [1]> 2 1 3909 <table> <table> -0.280 0.566 -1.24 0.677
## 4 <chr [2]> 2 1 3907 <table> <table> -0.280 0.566 -1.24 0.677
## 5 <chr [1]> 1 2 4065 <table> <table> -0.127 0.664 -0.700 0.446
## 6 <chr [2]> 1 2 4064 <table> <table> -0.130 0.657 -0.703 0.443
## 7 <chr [1]> 2 2 3609 <table> <table> -0.0904 0.789 -0.752 0.572
## 8 <chr [2]> 2 2 3607 <table> <table> -0.0917 0.786 -0.754 0.570
the develop()
function:
develop(tmp)
## # A tibble: 8 x 12
## clinical_signs antimicrobials filtering watching first_wks n ct prob or p lower upper
## <chr> <chr> <int> <int> <dbl> <int> <list> <list> <dbl> <dbl> <dbl> <dbl>
## 1 respiratory colistin_use 1 1 1 4370 <table> <table> -0.251 0.537 -1.05 0.545
## 2 respiratory colistin_use, cefotaxime_use 1 1 1 4369 <table> <table> -0.253 0.534 -1.05 0.543
## 3 respiratory colistin_use 2 1 1 3909 <table> <table> -0.280 0.566 -1.24 0.677
## 4 respiratory colistin_use, cefotaxime_use 2 1 1 3907 <table> <table> -0.280 0.566 -1.24 0.677
## 5 respiratory colistin_use 1 2 1 4065 <table> <table> -0.127 0.664 -0.700 0.446
## 6 respiratory colistin_use, cefotaxime_use 1 2 1 4064 <table> <table> -0.130 0.657 -0.703 0.443
## 7 respiratory colistin_use 2 2 1 3609 <table> <table> -0.0904 0.789 -0.752 0.572
## 8 respiratory colistin_use, cefotaxime_use 2 2 1 3607 <table> <table> -0.0917 0.786 -0.754 0.570
or the two in combination:
simplify(develop(tmp))
## # A tibble: 8 x 10
## antimicrobials filtering watching n ct prob or p lower upper
## <chr> <int> <int> <int> <list> <list> <dbl> <dbl> <dbl> <dbl>
## 1 colistin_use 1 1 4370 <table> <table> -0.251 0.537 -1.05 0.545
## 2 colistin_use, cefotaxime_use 1 1 4369 <table> <table> -0.253 0.534 -1.05 0.543
## 3 colistin_use 2 1 3909 <table> <table> -0.280 0.566 -1.24 0.677
## 4 colistin_use, cefotaxime_use 2 1 3907 <table> <table> -0.280 0.566 -1.24 0.677
## 5 colistin_use 1 2 4065 <table> <table> -0.127 0.664 -0.700 0.446
## 6 colistin_use, cefotaxime_use 1 2 4064 <table> <table> -0.130 0.657 -0.703 0.443
## 7 colistin_use 2 2 3609 <table> <table> -0.0904 0.789 -0.752 0.572
## 8 colistin_use, cefotaxime_use 2 2 3607 <table> <table> -0.0917 0.786 -0.754 0.570
which is equivalent to:
develop(simplify(tmp))
## # A tibble: 8 x 10
## antimicrobials filtering watching n ct prob or p lower upper
## <chr> <int> <int> <int> <list> <list> <dbl> <dbl> <dbl> <dbl>
## 1 colistin_use 1 1 4370 <table> <table> -0.251 0.537 -1.05 0.545
## 2 colistin_use, cefotaxime_use 1 1 4369 <table> <table> -0.253 0.534 -1.05 0.543
## 3 colistin_use 2 1 3909 <table> <table> -0.280 0.566 -1.24 0.677
## 4 colistin_use, cefotaxime_use 2 1 3907 <table> <table> -0.280 0.566 -1.24 0.677
## 5 colistin_use 1 2 4065 <table> <table> -0.127 0.664 -0.700 0.446
## 6 colistin_use, cefotaxime_use 1 2 4064 <table> <table> -0.130 0.657 -0.703 0.443
## 7 colistin_use 2 2 3609 <table> <table> -0.0904 0.789 -0.752 0.572
## 8 colistin_use, cefotaxime_use 2 2 3607 <table> <table> -0.0917 0.786 -0.754 0.570
Let’s download the data on the antimicrobials classes:
ab_classes <- readr::read_csv("https://raw.githubusercontent.com/viparc/clires_data/master/data/antimicrobial_classes.csv", col_types = "cc")
ab_classes$antimicrobial <- paste0(ab_classes$antimicrobial, "_use")
It looks like this:
ab_classes
## # A tibble: 44 x 2
## antimicrobial class
## <chr> <chr>
## 1 amoxicillin_use penecillins
## 2 ampicillin_use penecillins
## 3 cefadroxil_use cephalosporins
## 4 cefotaxime_use cephalosporins
## 5 cephalexin_use cephalosporins
## 6 ceftiofur_use cephalosporins
## 7 colistin_use polypeptides
## 8 enramycin_use polypeptides
## 9 enrofloxacin_use quinolones
## 10 flumequine_use quinolones
## # … with 34 more rows
We can use the run_plan()
function to compute odds ratios across antimicrobials classes, clinical signs and values 1 to 3 for the durations in weeks of the initial, filtering and watching periods (it takes about 15’ to run on 3 cores):
all_cs <- function(x) {
sel <- sapply(x, length)
x[which(sel > 1)] <- "all"
x
}
replace_empty <- function(x) {
x[which(x == "")] <- "all"
x
}
tmp <- run_plan(pipeline6, viparc, filtering = 1:3, watching = 1:3,
antimicrobials = c(split(ab_classes$antimicrobial, ab_classes$class), NA),
clinical_signs = c(as.list(setdiff(grep("_use$", names(viparc), value = TRUE, invert = TRUE), c("farm", "flock", "week"))), NA),
first_wks = 1:3, nb_cores = 3) %>%
dplyr::mutate(clinical_signs = unlist(all_cs(clinical_signs)),
antimicrobials = replace_empty(names(antimicrobials))) %>%
dplyr::rename(class = antimicrobials)
tmp <- names(tmp$antimicrobials)
cs <- tmp$clinical_signs
sel <- sapply(tmp$clinical_signs, length)
cs[which(sel > 1)] <- "all"
We can the split this output by the durations of the initial, filtering and watching periods, and spread the data to show antimicrobials classes in rows and clinical signs in columns:
tmp %>%
dplyr::mutate(or = exp(or),
lower = exp(lower),
upper = exp(upper)) %>%
dplyr::group_by(filtering, watching, first_wks) %>%
dplyr::group_split() %>%
lapply(. %>%
dplyr::select(clinical_signs, class, or) %>%
tidyr::pivot_wider(names_from = clinical_signs, values_from = or))
The analyses were performed on the system and versions as defined below:
sessionInfo(required)
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8
##
## attached base packages:
## character(0)
##
## other attached packages:
## [1] dplyr_0.8.3 magrittr_1.5 mgcv_1.8-31 purrr_0.3.3 readr_1.3.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.3 knitr_1.26 MASS_7.3-51.4 hms_0.5.2 splines_3.6.2
## [6] grDevices_3.6.2 tidyselect_0.2.99.9000 lattice_0.20-38 R6_2.4.1 rlang_0.4.2
## [11] fansi_0.4.0 stringr_1.4.0 mgcv.helper_0.1.8 tools_3.6.2 utils_3.6.2
## [16] grid_3.6.2 nlme_3.1-142 xfun_0.11 utf8_1.1.4 cli_2.0.0
## [21] htmltools_0.4.0 stats_3.6.2 datasets_3.6.2 yaml_2.2.0 assertthat_0.2.1
## [26] digest_0.6.23 tibble_2.1.3 base_3.6.2 crayon_1.3.4 Matrix_1.2-18
## [31] vctrs_0.2.0.9007 graphics_3.6.2 curl_4.3 glue_1.3.1 evaluate_0.14
## [36] rmarkdown_2.0 stringi_1.4.3 compiler_3.6.2 pillar_1.4.2 methods_3.6.2
## [41] pkgconfig_2.0.3
If you want to run the code locally, you can copy and paste code chunks to your commands line. Make sure though that your system and packages versions are the same or more recent than the ones listed above.
You can download the R markdown file used to generate this page as so:
download.file("https://raw.githubusercontent.com/viparc/prophylactic/master/prophylactic.Rmd", "prophylactic.Rmd")
You can transforme this prophylactic.Rmd
R markdown document in an R script named prophylactic.R
as so:
if (!require("kintr")) install.packages("knitr")
knitr::purl("prophylactic.Rmd")
script <- readLines("prophylactic.R")
writeLines(script[-c(seq_len(which(grepl("starts here", script))[1]),
which(grepl("session_info", script))[1]:length(script))],
"prophylactic.R")
To execute the code:
source("prophylactic.R")