Simulation Internals
More in-depth documentation on specific types and functions.
Contents
Index
Simulation.Barcode
Simulation.CRISPRi
Simulation.CRISPRn
Simulation.Cas9Behavior
Simulation.Delta
Simulation.FacsScreen
Simulation.GrowthScreen
Simulation.KDPhenotypeRelationship
Simulation.Library
Base.Sort.select
Simulation.auprc
Simulation.auroc
Simulation.construct_library
Simulation.counts_to_freqs
Simulation.differences_between_bins
Simulation.linear
Simulation.noise
Simulation.sequencing
Simulation.sigmoid
Simulation.signal
Simulation.transfect
Simulation.venn
Simulation Types
Simulation.FacsScreen
— Type.A type representing the parameters used in a typical FACS screen.
num_genes
Number of genes targeted by the screen
coverage
Number of guides per gene
representation
Number of cells with each guide.
representation=10
means that there are 10 times as many cells as guides during transfection. The actual number of cells per guide post-transfection will be less depending on the MOImoi
The multiplicity of infection, $\lambda$, of the screen. We model this as a Poisson process during transfection (see
Simulation.transfect
).NoteWe do not model multiple infections. We assume that the MOI is properly selected and less than half the cells are transfected by any virus, i.e. $\lambda \lt 0.5$ and then select only the cells that have a single transfection occurrence:
\[P(x = 1; Poisson(λ))\]For $\lambda = 0.25$ this works out to being $\approx 19.5\%$ of the number of cells (
num_genes
*coverage
*representation
).
σ
The standard deviation expected for cells during FACS sorting. This should be set according to the biological variance experimentally observed, e.g. in the fluorescence intensity of isogenic cells
bin_info
Range of guide phenotypes to collect in each bin
Example
In the following example
p = 0.05 bin_info = Dict(:bin1 => (0.0, p), :bin2 => (1.0-p, 1.0))
The 5th percentile of cells sorted according to their phenotype (fluorescence, size, etc) will be compared to the 95th percentile.
bottleneck_representation
Number of cells sorted expressed as an integer multiple of the number of guides
seq_depth
Sequencing depth as a integer multiple of the number of guides, i.e.
seq_depth=10
is equivalent to10 * num_genes * coverage
reads.
Simulation.GrowthScreen
— Type.A type representing the parameters used in a typical growth-based screen.
num_genes
Number of genes targeted by the screen
coverage
Number of guides per gene
representation
Number of cells with each guide.
representation=10
means that there are 10 times as many cells as guides during transfection. The actual number of cells per guide post-transfection will be less depending on the MOImoi
The multiplicity of infection, $\lambda$, of the screen. We model this as a Poisson process during transfection (see
Simulation.transfect
).NoteWe do not model multiple infections. We assume that the MOI is properly selected and less than half the cells are transfected by any virus, i.e. $\lambda \lt 0.5$ and then select only the cells that have a single transfection occurrence:
\[P(x = 1; Poisson(λ))\]For $\lambda = 0.25$ this works out to being $\approx 19.5\%$ of the number of cells (
num_genes
*coverage
*representation
).
seq_depth
Sequencing depth as a integer multiple of the number of guides, i.e.
seq_depth=10
is equivalent to10 * num_genes * coverage
reads.
bottleneck_representation
For growth screens, how much of a bottleneck is applied. This is minimum number of cells that is kept when passaging the pool of cells. This is expressed as an integer multiple of the number of guides.
num_bottlenecks
For growth screens, how many bottlenecks are applied. This is the integer number of passages during the growth screen.
noise
Before each passage, the theoretical phenotype of each cell is convolved with a normal noise distribution with a standard deviation, σ, dictated by this parameter. This should be set based on an expectation of noisiness of the subsampling
Key functions
Simulation.construct_library
— Function.construct_library(setup, lib)
Constructs the guide library for N
genes with coverage
number of guides per gene. Returns a tuple of guides and their relative frequencies (assigned randomly).
Simulation.transfect
— Function.transfect(setup, lib, guides, guide_freqs_dist)
Simulates the transfection the of the library of guides (guides
) according to the parameters in Simulation.Library
lib
and the frequencies represented by the Categorical distribution guide_freqs_dist
.
This function creates a population of cells post-transfection where each cell has single guide RNA present according to the frequencies of the guides in the present in the library. The cells are then grown up so that there are setup.bottleneck_representation * num_guides
of them. For FACS screens we assume a simple linear expansion of the cells, while for Growth screens the cells grow according as a function of their phenotype and degree of knockdown.
Base.Sort.select
— Function.select(setup, cells, cell_phenotypes, guides; debug)
Given cells
, a vector of integers, and guides
, a vector of barcodes, performs simulated FACS sorting of the cells into bins
with the given cutoffs. σ
refers to the spread of the phenotype during the FACS screen.
select(setup, initial_cells, initial_cell_phenotypes, guides; debug)
Growth Screen selection
Simulation.sequencing
— Function.sequencing(depths, guides, samples)
Simulates next-gen sequencing by generating a Categorical distribution based on frequencies of each guide in each bin and then randomly samples from this distributions up to the depth provided in depths
for each bin. Returns a dict of DataFrames with the bin names as the keys. This object can then be passed through Simulation.counts_to_freqs
followed by Simulation.differences_between_bins
.
Simulation.counts_to_freqs
— Function.counts_to_freqs(bin_cells, guide_count)
Converts raw counts to frequencies by dividing by the total number of reads for each sample.
Simulation.differences_between_bins
— Function.differences_between_bins(raw_data)
Given the raw data from Simulation.sequencing
returns two DataFrames
guide_data
: This DataFrame contains the per-guide level data including the log2 fold change in the normalized frequencies of each guide between each pairwise combination of bins. Thus, if there aren
bins, then it computes the log2 fold changes for the $\frac{n!}{2(n-2)!}$ combinationsgene_data
: This DataFrame contains the same information but grouped by gene. The log2 fold change data from the first DataFrame is used to calculate the average log2 fold change per gene and a pvalue computed using a Mann-Whitney U-test as measure of how consistently shifted the guides are of this gene versus the population of negative control guides. (see below for more info)
A typical 2 bin guide_data
DataFrame contains the following columns:
Column Name | Meaning |
---|---|
gene | the gene ID of that this guide targets |
knockdown | activity of the guide on 0 to 1 scale, where 1 is complete knockout |
barcodeid | the ID of this specific guide |
theo_phenotype | expected phenotype of this guide, generally a -1 to 1 scale |
behavior | whether the target gene displays a linear or sigmoidal response to incomplete knockdown (see Simulation.Library for more details) |
class | which phenotype distribution the target gene was drawn from (see Simulation.Library for more details). Serves as the "ground truth" label against which screen performance is evaluated, e.g. with Simulation.auprc |
initial_freq | frequency of guide post-transfection (see Simulation.transfect ) |
counts_bin1 | the raw number of reads for each guide in the first bin |
freqs_bin1 | the number of reads for each guide divided by the total number of reads in this bin |
rel_freqs_bin1 | the frequency of each guide divided by the median frequency of negative control guides |
counts_bin2 | the raw number of reads for each guide in the second bin |
freqs_bin2 | the number of reads for each guide divided by the total number of reads in this bin |
rel_freqs_bin2 | the frequency of each guide divided by the median frequency of negative control guides for this bin |
log2fc_bin2_div_bin1 | the log2 fold change in relative guide frequencies between bin2 and bin1 , equivalent to log2(rel_freqs_bin2/rel_freqs_bin1) |
A typical gene_data
DataFrame contains the following data:
Column Name | Meaning |
---|---|
gene | this gene's ID |
class | see above |
behavior | see above |
mean_bin2_div_bin1 | the mean log 2 fold change in relative frequencies between from bin1 to bin2 for all the guides targeting this gene. Calculated as $\frac{1}{k}\sum_k \text{log2fc_bin2_div_bin1}_k$ for the $k$ guides targeting each gene |
pvalue_bin2_div_bin1 | the -log10 pvalue of the log2 fold changes of all guides targeting this gene as computed by the non-parametric Mann-Whitney U-test. A measure of the consistency of the log 2 fold changes[1] |
absmean_bin2_div_bin1 | absolute value of mean_bin2_div_bin1 per-gene |
pvalmeanprod_bin2_div_bin1 | mean_bin2_div_bin1 multiplied with the pvalue_bin2_div_bin1 per-gene |
Further reading
Kampmann M, Bassik MC, Weissman JS. Integrated platform for genome-wide screening and construction of high-density genetic interaction maps in mammalian cells. Proc Natl Acad Sci U S A. 2013;110:E2317–26.
Miscellaneous Types
Simulation.Library
— Type.Wrapper containing all library construction parameters
knockdown_dist
Distribution of guide knockdown efficiencies
knockdown_probs
max_phenotype_dists
Maximum phenotype categories mapped to their probability of being selected and the distribution to draw from if they are selected.
Example
The basic layout is a
Dict
mapping a class name to a tuple of the probability of selecting this class and then theDistributions.Sampleable
from which to draw a random phenotype from this class. The probabilities across all the classes should add up to 1.max_phenotype_dists = Dict{Symbol, Tuple{Float64, Sampleable}}( :inactive => (0.60, Delta(0.0)), :negcontrol => (0.1, Delta(0.0)), :increasing => (0.3, TruncatedNormal(0.1, 0.1, 0.025, 1)), ); Library(max_phenotype_dists, CRISPRi());
For example, here we are making three different classes of "genes": the first group are :inactive, i.e. they have no phenotype, so we'll set their phenotypes to 0.0 using a
Simulation.Delta
. We'll also make them 60% of all the genes. The second group are the negative controls :negcontrol (the only required group) which make up 10% of the population of genes and also have no effect. The final group is :increasing which makes up 30% of all genes and which are represented by a Normal(μ=0.1, σ=0.1) distribution clamped between 0.025 and 1.
phenotype_probs
kd_phenotype_relationships
Knockdown-phenotype relationships mapped to their probability of being selected and their respective
Simulation.KDPhenotypeRelationship
relationship_probs
cas9_behavior
Whether this library is
Simulation.CRISPRi
orSimulation.CRISPRn
Simulation.Barcode
— Type.Any entity that is tracked through the pooled experiment. For CRISPR screens, this is equivalent to the sgRNA. This object stores properties relating to the performance of this entity in the screen.
gene
The target gene id
knockdown
The knockdown efficiency
theo_phenotype
The theoretical phenotype exerted
obs_phenotype
The observed phenotype (only relevant for FACS screens)
behavior
The gene behavior, either sigmoidal or linear
class
The gene activity class, either inactive, up or down
initial_freq
Initial frequency after transfection
A type representing a relationship between degree of knockdown and effect on phenotype
Simulation.Cas9Behavior
— Type.A type representing the behavior of different Cas9s
Simulation.CRISPRn
— Type.type CRISPRn <: Simulation.Cas9Behavior
CRISPR KO behavior is more complex since sgRNA-directed DNA damage repair is stochastic. We assume that 2/3 of repair events at a given locus lead to a frameshift, and that the screen is carried out in diploid cells. The assumption that only bi-allelic frame-shift mutations lead to a phenotype in CRISPRn screens for most sgRNAs is supported by the empirical finding that in-frame deletions mostly do not show strong phenotypes, unless they occur in regions encoding conserved residues or domains[2]
Horlbeck MA, Gilbert LA, Villalta JE, Adamson B, Pak RA, Chen Y, Fields AP, Park CY, Corn JE, Kampmann M, Weissman JS: Compact and highly active next- generation libraries for CRISPR-mediated gene repression and activation. Elife 2016, 5.
Simulation.CRISPRi
— Type.type CRISPRi <: Simulation.Cas9Behavior
CRISPRi behavior is simply determined by the activity of the guide
Simulation.Delta
— Type.type Delta <: Distributions.Sampleable{Distributions.Univariate,Distributions.Discrete}
Constructs a delta function at a given δ value. This distribution always emits the same value.
Miscellaneous functions
Simulation.signal
— Function.signal(guide_data; log2fc_col)
Computes the signal in an experiment, where the experimental signal is defined to be the average of signal of true hit genes. That value, the true hit gene signal, is the average ratio of the log2 fold change for a guide targeting a specific gene to the guide's theoretical phenotype.
where $N_{true}$ is the number of true hit genes and $k$ is the number of genes.
Simulation.noise
— Function.noise(guide_data; log2fc_col)
Computes the noise in an experiment, where noise is defined to be the standard deviation of log2 fold change in the negative controls. If more than 2 bins are used then the name of log2 fold change information can be provided to differentiate between multiple log 2 fold changes.
Simulation.linear
— Function.linear(x, l)
Given value(s) x
apply a simple linear function with maximum value l
Simulation.sigmoid
— Function.sigmoid(x, l, k, p)
Given value(s) x
apply the sigmoidal function with maximum value l
, a steepness k
, and an inflection point p
.
Simulation.auprc
— Function.auprc(scores, classes, pos_labels; rev)
Computes the area under the Precision-Recall curve using a lower trapezoidal estimator, which is more accurate for skewed datasets.
K. Boyd, K. H. Eng, and C. D. Page, “Area under the Precision-Recall Curve: Point Estimates and Confidence Intervals,” in Machine Learning and Knowledge Discovery in Databases, H. Blockeel, K. Kersting, S. Nijssen, and F. Železný, Eds. Springer Berlin Heidelberg, 2013, pp. 451–466.
Simulation.auroc
— Function.auroc(scores, classes, pos_labels; rev)
Optimized function for computing the area under the receiver operator characteristic curve.
Simulation.venn
— Function.venn(scores, classes, pos_labels; rev)
Given N
positive examples, computes the percentage of the top N/2
hits that are correct