Simulation Internals

More in-depth documentation on specific types and functions.

Simulation Internals

Simulation Types

Simulation.FacsScreen — Type.

A type representing the parameters used in a typical FACS screen.

num_genes
Number of genes targeted by the screen
coverage
Number of guides per gene
representation
Number of cells with each guide. representation=10 means that there are 10 times as many cells as guides during transfection. The actual number of cells per guide post-transfection will be less depending on the MOI
moi
The multiplicity of infection, $\lambda$, of the screen. We model this as a Poisson process during transfection (see Simulation.transfect).
Note
We do not model multiple infections. We assume that the MOI is properly selected and less than half the cells are transfected by any virus, i.e. $\lambda \lt 0.5$ and then select only the cells that have a single transfection occurrence:
\[P(x = 1; Poisson(λ))\]
For $\lambda = 0.25$ this works out to being $\approx 19.5\%$ of the number of cells (num_genes * coverage * representation).

σ
The standard deviation expected for cells during FACS sorting. This should be set according to the biological variance experimentally observed, e.g. in the fluorescence intensity of isogenic cells
bin_info
Range of guide phenotypes to collect in each bin
Example
In the following example
```
p = 0.05
bin_info = Dict(:bin1 => (0.0, p), :bin2 => (1.0-p, 1.0))
```
The 5th percentile of cells sorted according to their phenotype (fluorescence, size, etc) will be compared to the 95th percentile.

bottleneck_representation
Number of cells sorted expressed as an integer multiple of the number of guides
seq_depth
Sequencing depth as a integer multiple of the number of guides, i.e. seq_depth=10 is equivalent to 10 * num_genes * coverage reads.

source

Simulation.GrowthScreen — Type.

A type representing the parameters used in a typical growth-based screen.

num_genes
Number of genes targeted by the screen
coverage
Number of guides per gene
representation
Number of cells with each guide. representation=10 means that there are 10 times as many cells as guides during transfection. The actual number of cells per guide post-transfection will be less depending on the MOI
moi
The multiplicity of infection, $\lambda$, of the screen. We model this as a Poisson process during transfection (see Simulation.transfect).
Note
We do not model multiple infections. We assume that the MOI is properly selected and less than half the cells are transfected by any virus, i.e. $\lambda \lt 0.5$ and then select only the cells that have a single transfection occurrence:
\[P(x = 1; Poisson(λ))\]
For $\lambda = 0.25$ this works out to being $\approx 19.5\%$ of the number of cells (num_genes * coverage * representation).

seq_depth
Sequencing depth as a integer multiple of the number of guides, i.e. seq_depth=10 is equivalent to 10 * num_genes * coverage reads.

bottleneck_representation
For growth screens, how much of a bottleneck is applied. This is minimum number of cells that is kept when passaging the pool of cells. This is expressed as an integer multiple of the number of guides.
num_bottlenecks
For growth screens, how many bottlenecks are applied. This is the integer number of passages during the growth screen.
noise
Before each passage, the theoretical phenotype of each cell is convolved with a normal noise distribution with a standard deviation, σ, dictated by this parameter. This should be set based on an expectation of noisiness of the subsampling

source

Key functions

Simulation.construct_library — Function.

construct_library(setup, lib)

Constructs the guide library for N genes with coverage number of guides per gene. Returns a tuple of guides and their relative frequencies (assigned randomly).

source

Simulation.transfect — Function.

transfect(setup, lib, guides, guide_freqs_dist)

Simulates the transfection the of the library of guides (guides) according to the parameters in Simulation.Library lib and the frequencies represented by the Categorical distribution guide_freqs_dist.

This function creates a population of cells post-transfection where each cell has single guide RNA present according to the frequencies of the guides in the present in the library. The cells are then grown up so that there are setup.bottleneck_representation * num_guides of them. For FACS screens we assume a simple linear expansion of the cells, while for Growth screens the cells grow according as a function of their phenotype and degree of knockdown.

source

Base.Sort.select — Function.

select(setup, cells, cell_phenotypes, guides; debug)

Given cells, a vector of integers, and guides, a vector of barcodes, performs simulated FACS sorting of the cells into bins with the given cutoffs. σ refers to the spread of the phenotype during the FACS screen.

source

select(setup, initial_cells, initial_cell_phenotypes, guides; debug)

Growth Screen selection

source

Simulation.sequencing — Function.

sequencing(depths, guides, samples)

Simulates next-gen sequencing by generating a Categorical distribution based on frequencies of each guide in each bin and then randomly samples from this distributions up to the depth provided in depths for each bin. Returns a dict of DataFrames with the bin names as the keys. This object can then be passed through Simulation.counts_to_freqs followed by Simulation.differences_between_bins.

source

Simulation.counts_to_freqs — Function.

counts_to_freqs(bin_cells, guide_count)

Converts raw counts to frequencies by dividing by the total number of reads for each sample.

source

Simulation.differences_between_bins — Function.

differences_between_bins(raw_data)

Given the raw data from Simulation.sequencing returns two DataFrames

guide_data: This DataFrame contains the per-guide level data including the log2 fold change in the normalized frequencies of each guide between each pairwise combination of bins. Thus, if there are n bins, then it computes the log2 fold changes for the $\frac{n!}{2(n-2)!}$ combinations
gene_data: This DataFrame contains the same information but grouped by gene. The log2 fold change data from the first DataFrame is used to calculate the average log2 fold change per gene and a pvalue computed using a Mann-Whitney U-test as measure of how consistently shifted the guides are of this gene versus the population of negative control guides. (see below for more info)

A typical 2 bin guide_data DataFrame contains the following columns:

Column Name	Meaning
`gene`	the gene ID of that this guide targets
`knockdown`	activity of the guide on 0 to 1 scale, where 1 is complete knockout
`barcodeid`	the ID of this specific guide
`theo_phenotype`	expected phenotype of this guide, generally a -1 to 1 scale
`behavior`	whether the target gene displays a linear or sigmoidal response to incomplete knockdown (see `Simulation.Library` for more details)
`class`	which phenotype distribution the target gene was drawn from (see `Simulation.Library` for more details). Serves as the "ground truth" label against which screen performance is evaluated, e.g. with `Simulation.auprc`
`initial_freq`	frequency of guide post-transfection (see `Simulation.transfect`)
`counts_bin1`	the raw number of reads for each guide in the first bin
`freqs_bin1`	the number of reads for each guide divided by the total number of reads in this bin
`rel_freqs_bin1`	the frequency of each guide divided by the median frequency of negative control guides
`counts_bin2`	the raw number of reads for each guide in the second bin
`freqs_bin2`	the number of reads for each guide divided by the total number of reads in this bin
`rel_freqs_bin2`	the frequency of each guide divided by the median frequency of negative control guides for this bin
`log2fc_bin2_div_bin1`	the log2 fold change in relative guide frequencies between `bin2` and `bin1`, equivalent to `log2(rel_freqs_bin2/rel_freqs_bin1)`

A typical gene_data DataFrame contains the following data:

Column Name	Meaning
`gene`	this gene's ID
`class`	see above
`behavior`	see above
`mean_bin2_div_bin1`	the mean log 2 fold change in relative frequencies between from `bin1` to `bin2` for all the guides targeting this gene. Calculated as $\frac{1}{k}\sum_k \text{log2fc_bin2_div_bin1}_k$ for the $k$ guides targeting each gene
`pvalue_bin2_div_bin1`	the -log10 pvalue of the log2 fold changes of all guides targeting this gene as computed by the non-parametric Mann-Whitney U-test. A measure of the consistency of the log 2 fold changes[1]
`absmean_bin2_div_bin1`	absolute value of `mean_bin2_div_bin1` per-gene
`pvalmeanprod_bin2_div_bin1`	`mean_bin2_div_bin1` multiplied with the `pvalue_bin2_div_bin1` per-gene

Further reading

[1]

Kampmann M, Bassik MC, Weissman JS. Integrated platform for genome-wide screening and construction of high-density genetic interaction maps in mammalian cells. Proc Natl Acad Sci U S A. 2013;110:E2317–26.

source

Miscellaneous Types

Simulation.Library — Type.

Wrapper containing all library construction parameters

knockdown_dist
Distribution of guide knockdown efficiencies
knockdown_probs
max_phenotype_dists
Maximum phenotype categories mapped to their probability of being selected and the distribution to draw from if they are selected.
Example
The basic layout is a Dict mapping a class name to a tuple of the probability of selecting this class and then the Distributions.Sampleable from which to draw a random phenotype from this class. The probabilities across all the classes should add up to 1.
```
max_phenotype_dists = Dict{Symbol, Tuple{Float64, Sampleable}}(
    :inactive => (0.60, Delta(0.0)),
    :negcontrol => (0.1, Delta(0.0)),
    :increasing => (0.3, TruncatedNormal(0.1, 0.1, 0.025, 1)),
);
Library(max_phenotype_dists, CRISPRi());
```
For example, here we are making three different classes of "genes": the first group are :inactive, i.e. they have no phenotype, so we'll set their phenotypes to 0.0 using a Simulation.Delta. We'll also make them 60% of all the genes. The second group are the negative controls :negcontrol (the only required group) which make up 10% of the population of genes and also have no effect. The final group is :increasing which makes up 30% of all genes and which are represented by a Normal(μ=0.1, σ=0.1) distribution clamped between 0.025 and 1.

phenotype_probs
kd_phenotype_relationships
Knockdown-phenotype relationships mapped to their probability of being selected and their respective Simulation.KDPhenotypeRelationship

relationship_probs
cas9_behavior
Whether this library is Simulation.CRISPRi or Simulation.CRISPRn

source

Simulation.Barcode — Type.

Any entity that is tracked through the pooled experiment. For CRISPR screens, this is equivalent to the sgRNA. This object stores properties relating to the performance of this entity in the screen.

gene
The target gene id
knockdown
The knockdown efficiency
theo_phenotype
The theoretical phenotype exerted
obs_phenotype
The observed phenotype (only relevant for FACS screens)
behavior
The gene behavior, either sigmoidal or linear
class
The gene activity class, either inactive, up or down
initial_freq
Initial frequency after transfection

source

Simulation.KDPhenotypeRelationship — Type.

A type representing a relationship between degree of knockdown and effect on phenotype

source

Simulation.Cas9Behavior — Type.

A type representing the behavior of different Cas9s

source

Simulation.CRISPRn — Type.

type CRISPRn <: Simulation.Cas9Behavior

CRISPR KO behavior is more complex since sgRNA-directed DNA damage repair is stochastic. We assume that 2/3 of repair events at a given locus lead to a frameshift, and that the screen is carried out in diploid cells. The assumption that only bi-allelic frame-shift mutations lead to a phenotype in CRISPRn screens for most sgRNAs is supported by the empirical finding that in-frame deletions mostly do not show strong phenotypes, unless they occur in regions encoding conserved residues or domains[2]

[2]

Horlbeck MA, Gilbert LA, Villalta JE, Adamson B, Pak RA, Chen Y, Fields AP, Park CY, Corn JE, Kampmann M, Weissman JS: Compact and highly active next- generation libraries for CRISPR-mediated gene repression and activation. Elife 2016, 5.

source

Simulation.CRISPRi — Type.

type CRISPRi <: Simulation.Cas9Behavior

CRISPRi behavior is simply determined by the activity of the guide

source

Simulation.Delta — Type.

type Delta <: Distributions.Sampleable{Distributions.Univariate,Distributions.Discrete}

Constructs a delta function at a given δ value. This distribution always emits the same value.

source

Miscellaneous functions

Simulation.signal — Function.

signal(guide_data; log2fc_col)

Computes the signal in an experiment, where the experimental signal is defined to be the average of signal of true hit genes. That value, the true hit gene signal, is the average ratio of the log2 fold change for a guide targeting a specific gene to the guide's theoretical phenotype.

\[\frac{1}{N_{true} \times k} \sum_{i=1}^{N_{true}} \sum_{j=1}^k \frac{\log_2 fc_{ij}}{\text{theo phenotype}_{ij}}\]

where $N_{true}$ is the number of true hit genes and $k$ is the number of genes.

source

Simulation.noise — Function.

noise(guide_data; log2fc_col)

Computes the noise in an experiment, where noise is defined to be the standard deviation of log2 fold change in the negative controls. If more than 2 bins are used then the name of log2 fold change information can be provided to differentiate between multiple log 2 fold changes.

source

Simulation.linear — Function.

linear(x, l)

Given value(s) x apply a simple linear function with maximum value l

source

Simulation.sigmoid — Function.

sigmoid(x, l, k, p)

Given value(s) x apply the sigmoidal function with maximum value l, a steepness k, and an inflection point p.

source

Simulation.auprc — Function.

auprc(scores, classes, pos_labels; rev)

Computes the area under the Precision-Recall curve using a lower trapezoidal estimator, which is more accurate for skewed datasets.

K. Boyd, K. H. Eng, and C. D. Page, “Area under the Precision-Recall Curve: Point Estimates and Confidence Intervals,” in Machine Learning and Knowledge Discovery in Databases, H. Blockeel, K. Kersting, S. Nijssen, and F. Železný, Eds. Springer Berlin Heidelberg, 2013, pp. 451–466.

source

Simulation.auroc — Function.

auroc(scores, classes, pos_labels; rev)

Optimized function for computing the area under the receiver operator characteristic curve.

source

Simulation.venn — Function.

venn(scores, classes, pos_labels; rev)

Given N positive examples, computes the percentage of the top N/2 hits that are correct

source

Simulation Internals

Contents

Index

Simulation Types

Example

Key functions

Miscellaneous Types

Example

Miscellaneous functions