Title: | Example Data Sets for Use with Discrete Statistical Tests |
---|---|
Description: | Provides several data sets for use with discrete statistical tests and discrete multiple testing procedures. Some of them are also available as a four-column version, so that each row represents a 2x2 table. |
Authors: | Christina Kihn [aut], Sebastian Döhler [aut], Florian Junge [cre, aut] |
Maintainer: | Florian Junge <[email protected]> |
License: | GPL-3 |
Version: | 0.1.1 |
Built: | 2024-10-25 04:31:43 UTC |
Source: | https://github.com/disohda/discretedatasets |
This package contains example datasets for use with discrete statistical tests and discrete multiple testing procedures. Some of them are also available as a four-column version, so that each row represents a 2x2 table.
Maintainer: Florian Junge [email protected]
Authors:
Christina Kihn
Sebastian Döhler
Useful links:
Report bugs at https://github.com/DISOhda/DiscreteDatasets/issues
Read counts per gene for airway smooth muscle cell lines RNA-Seq experiment
data("airway") data("airway_treat") data("airway_four_columns")
data("airway") data("airway_treat") data("airway_four_columns")
airway
is a data.frame
with 63,677 rows and 8 columns. Each row
corresponds to a specific gene and each column to a labeled sample.
airway_treat
is a data.frame
with 63,677 rows representing
genes with the following two columns:
Number of reads for the specific gene in all treated samples.
Number of reads for the specific gene in all untreated samples.
Thus, each line describes a 2x2 table, e.g.:
ENSG00000000003 | This gene | All other genes |
Treatment | |
89,561,179 -
|
No Treatment | |
85,955,244 -
|
airway_four_columns
is a data.frame
with 63,677 rows
representing genes with the following four columns:
Number of reads for the specific gene in all treated samples.
Number of reads for the specific gene in all untreated samples.
Number of reads for all other genes in all treated samples.
Number of reads for tall other genes in all untreated samples.
Thus, each line describes a 2x2 table, e.g.:
ENSG00000000003 | This gene | All other genes |
Treatment | |
|
No Treatment | |
|
The cell lines of the even-numbered samples were treated with dexamethasone, whereas the cell lines of the odd-numbered samples were not. There were 89,561,179 reads for all treated samples and 85,955,244 for the untreated ones.
The original airway
dataset has been taken from the
airway
BioConductor package. Since the original
data would require other BioConductor packages to access it, it has been
reformatted to a standard data frame (with assay(airway)
) which only
contains the raw numeric data.
FASTQ files from SRA, phenotypic data from GEO
Himes, B. E., Jiang, X., Wagner, P., Hu, R., Wang, Q., Klanderman, B., Whitaker, R. M., Duan, Q., Lasky-Su, J., Nikolos, C., Jester, W., Johnson, M., Panettieri, R. Jr., Tantisira, K. G., Weiss, S. T., Lu, Q. (2014). RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells. PLoS One 9(6). doi:10.1371/journal.pone.0099625
For each of 2,446 drugs in the MHRA database (column 1), the number of cases with amnesia as an adverse event (column 2), and the number of cases with other adverse event for this drug (column 3). In total, 684,652 adverse drug reactions were reported, among them 2,044 cases of amnesia.
data("amnesia") data("amnesia_four_columns")
data("amnesia") data("amnesia_four_columns")
amnesia
is a data.frame
with 2,446 rows representing drugs
with the following two columns:
Number of the amnesia cases reported for the drug.
Number of other adverse drug reactions reported for the drug.
Thus, each line describes a 2x2 table, e.g.:
1-ANDROSTENEDIOL | This drug | All other drugs |
Amnesia cases | |
2,044 -
|
Other adverse cases | |
682,648 -
|
amnesia_four_columns
is a data.frame
with 2,446 rows
representing drugs with the following four columns:
Number of the amnesia cases reported for the drug.
Number of the amnesia cases reported for all other drugs.
Number of other adverse drug reactions reported for the drug.
Number of other adverse drug reactions reported for all other drugs.
Thus, each line describes a 2x2 table:
1-ANDROSTENEDIOL | This drug | All other drugs |
Amnesia cases | |
|
Other adverse cases | |
|
The data was collected from the Drug Analysis Prints published by the Medicines and Healthcare products Regulatory Agency (MHRA), by Heller & Gur. See references for more details.
The original amnesia
dataset has been taken from the
discreteMTP
package, which is no longer available on CRAN. It has been
reformatted such that the names in first column are now row descriptions;
this way, the actual contents of the table are purely numeric.
Drug Analysis Prints on MHRA site
R. Heller and H. Gur (2011). False discovery rate controlling procedures for discrete tests. arXiv pre-print arXiv:1112.4627v2 link.
For earlier recognition of diseases, multiple variations of the human base sequence get studied. The so-called coverage of each base is calculated to detect duplicates, deletions and insertions in the base sequence. To find these variations a hypothesis-test gets performed for each base in the tested area. The null-hypothesis being that the coverage of the base is as expected under the null-hypothesis (expected coverage Cb can be calculated using a given formula, following a poisson distribution). If the observed coverage is exceptionally high or low the null-hypothesis gets rejected. For each type of variation there is a different formula to calculate the expected coverages. The expected coverages in this data set were calculated using the formula for a local test without GC-correction.
data("disorderdetection")
data("disorderdetection")
A data frame with 315 rows representing a base sequence with the following 2 columns:
observed frequencies
Observed coverage of each base
expected frequencies
Expected coverage of each base
The data was collected from the "Goodness-of-fit tests for disorder detection in NGS experiments" published by the Biometrical Journal , by Jiménez-Otero, de Uña-Álvarez and Pardo-Fernández. See references for more details.
Jiménez-Otero N, de Uña-Álvarez J, Pardo-Fernández JC (2019). Goodness-of-fit tests for disorder detection in NGS experiments. Biometrical Journal, 61(2), pp. 424-441. doi:10.1002/bimj.201700284.
This data set has been analyzed and provided by the listed reference. Examined were two groups with different types of HIV (Type B and Type C), each consisting of 73 participants. Within both groups the number of amino-acid mutations at each position was determined.
data("hiv") data("hiv_four_columns")
data("hiv") data("hiv_four_columns")
hiv
is a data.frame
with 118 rows and the following two
columns:
Number of test subjects with HIV type C and mutated i-th amino acid.
Number of test subjects with HIV type B and mutated i-th amino acid.
Thus, each row describes a 2x2 table:
Subject 1 | Mutation | No mutation |
Type C | |
73 -
|
Type B | |
73 -
|
hiv_four_columns
is a data.frame
with 118 rows and the
following four columns:
Number of test subjects with HIV type C and mutated i-th amino acid.
Number of test subjects with HIV type B and mutated i-th amino acid.
Number of test subjects with HIV type C and non-mutated i-th amino acid.
Number of test subjects with HIV type B and non-mutated i-th amino acid.
Thus, each row describes a 2x2 table:
Subject 1 | mutation | no mutation |
Type C | |
|
Type B | |
|
The original hiv
dataset has been taken from the
fdrDiscreteNull
package, where it is
named hivdata
.
Gilbert, P. B. (2005). A modified false discovery rate multiple-comparisons procedure for discrete data, applied to human immunodeficiency virus genetics. Journal of the Royal Statistical Society, 54(1), pp. 143-158. doi:10.1111/j.1467-9876.2005.00475.x
This dataset has been analyzed and provided by the listed reference. There
are around 22,000 cytosines, each of which is under two conditions. For each
cytosine under each condition, there is only one replicate. The discrete
count for each replicate can be modeled by binomial distribution, and
Fisher's exact test can be applied to assess if a cytosine is differentially
methylated. The filtered data lister
contains cytosines whose
total counts for both lines are greater than 5 and whose count for each line
does not exceed 25.
data("listerdata") data("listerdata_four_columns")
data("listerdata") data("listerdata_four_columns")
listerdata
is a data.frame
with 3,525 rows and the following two
columns:
Degree of methylation of the i-th cytosine in reference genome.
Degree of methylation of the i-th cytosine in mutated genome.
Thus, each row describes a 2x2 table:
AT1G01070.1 | This cytosine | All other cytosines |
Col0 counts | |
34,244 -
|
Met13 counts | |
39,342 -
|
listerdata_four_columns
is a data.frame
with 3,525 rows and the
following four columns:
Degree of methylation of the i-th cytosine in reference genome.
Degree of methylation of the i-th cytosine in mutated genome.
Degree of methylation of all other cytosines in reference genome.
Degree of methylation of all other cytosines in mutated genome.
AT1G01070.1 | This cytosine | All other cytosines |
Col0 counts | |
|
Met13 counts | |
|
The original listerdata
dataset has been taken from the
fdrDiscreteNull
package.
Lister, R., O'Malley, R., Tonti-Filippini, J., Gregory, B. D., Berry, C. C., Millar, A. H. & Ecker, J. R. (2008). Highly integrated single-base resolution maps of the epigenome in arabidopsis, Cell 133(3), pp. 523-536. doi:10.1016/j.cell.2008.03.029
Sometimes, fourfold tables are reformatted by replacing rows or columns by
marginal totals. This makes it impossible to use them straight away for
statistical tests like Fisher's exact test. But with that knowledge, the
missing values can easily be restored. The reconstruct_four
function uses a set of such reduced tables, stored row-wise in a matrix or a
data frame, and rebuilds the two reformatted cells when they were replaced by
marginal totals.
reconstruct_four(dat, idx_marginals = NULL, colnames_add = NULL)
reconstruct_four(dat, idx_marginals = NULL, colnames_add = NULL)
dat |
integer matrix or data frame with exactly two
columns; each row represents the first column of a
2x2 matrix for which the other two values are to
be computed and appended to |
idx_marginals |
integer vector of exactly two values or
|
colnames_add |
character vector of exactly two unique character
strings or |
An integer data frame with four columns.
X1 <- c(4, 2, 2, 14, 6, 9, 4, 0, 1) X2 <- c(0, 0, 1, 3, 2, 1, 2, 2, 2) N1 <- rep(148, 9) N2 <- rep(132, 9) df1 <- data.frame(X1, X2, N1, N2) reconstruct_four(df1, colnames_add = c("Y1", "Y2")) # same as reconstruct_four(df1, c(3, 4), c("Y1", "Y2")) df2 <- data.frame(X1, N1, X2, N2) reconstruct_four(df2, c(2, 4), c("Y1", "Y2"))
X1 <- c(4, 2, 2, 14, 6, 9, 4, 0, 1) X2 <- c(0, 0, 1, 3, 2, 1, 2, 2, 2) N1 <- rep(148, 9) N2 <- rep(132, 9) df1 <- data.frame(X1, X2, N1, N2) reconstruct_four(df1, colnames_add = c("Y1", "Y2")) # same as reconstruct_four(df1, c(3, 4), c("Y1", "Y2")) df2 <- data.frame(X1, N1, X2, N2) reconstruct_four(df2, c(2, 4), c("Y1", "Y2"))
In some situations, fourfold tables are reduced to two elements, which makes
it impossible to use them straight away for statistical tests like Fisher's
exact test. But sometimes, when all tables had the same known marginal sums,
the missing values can be restored using that additional information. The
reconstruct_two
function uses a set of such reduced tables, stored
row-wise in a matrix or a data frame, and rebuilds the two missing columns
from automatically computed or given marginal totals.
reconstruct_two( dat, totals = NULL, insert_at = NULL, colnames_add = NULL, colnames_prepend = NULL, colnames_append = NULL, colnames_sep = "_" )
reconstruct_two( dat, totals = NULL, insert_at = NULL, colnames_add = NULL, colnames_prepend = NULL, colnames_append = NULL, colnames_sep = "_" )
dat |
integer matrix or data frame with exactly two
columns; each row represents the first column of a
2x2 matrix for which the other two values are to
be computed and appended to |
totals |
integer vector of exactly one or two values or
|
insert_at |
integer vector of exactly two values between 1 and
4 or |
colnames_add |
character vector of exactly two unique character
strings or |
colnames_prepend |
character vector of exactly two unique character
strings ( |
colnames_append |
character vector of exactly two unique character
strings ( |
colnames_sep |
a single character or |
An integer data frame with four columns.
data(amnesia) amnesia_four_columns <- reconstruct_two( amnesia, NULL, NULL, NULL, NULL, c("ThisDrug", "AllOtherDrugs"), "." ) head(amnesia_four_columns) data(hiv) hiv_four_columns <- reconstruct_two( hiv, 73, NULL, NULL, NULL, c("Mutation", "NoMutation"), "." ) head(hiv_four_columns) data(listerdata) listerdata_four_columns <- reconstruct_two( listerdata, c(34244, 39342), NULL, NULL, NULL, c("This_Cyto", "All_Other_Cytos"), "_" ) head(listerdata_four_columns)
data(amnesia) amnesia_four_columns <- reconstruct_two( amnesia, NULL, NULL, NULL, NULL, c("ThisDrug", "AllOtherDrugs"), "." ) head(amnesia_four_columns) data(hiv) hiv_four_columns <- reconstruct_two( hiv, 73, NULL, NULL, NULL, c("Mutation", "NoMutation"), "." ) head(hiv_four_columns) data(listerdata) listerdata_four_columns <- reconstruct_two( listerdata, c(34244, 39342), NULL, NULL, NULL, c("This_Cyto", "All_Other_Cytos"), "_" ) head(listerdata_four_columns)