Simulation Internals

# Simulation Internals

More in-depth documentation on specific types and functions.

## Simulation Types

A type representing the parameters used in a typical FACS screen.

• num_genes

Number of genes targeted by the screen

• coverage

Number of guides per gene

• representation

Number of cells with each guide. representation=10 means that there are 10 times as many cells as guides during transfection. The actual number of cells per guide post-transfection will be less depending on the MOI

• moi

The multiplicity of infection, $\lambda$, of the screen. We model this as a Poisson process during transfection (see Simulation.transfect).

Note

We do not model multiple infections. We assume that the MOI is properly selected and less than half the cells are transfected by any virus, i.e. $\lambda \lt 0.5$ and then select only the cells that have a single transfection occurrence:

$P(x = 1; Poisson(λ))$

For $\lambda = 0.25$ this works out to being $\approx 19.5\%$ of the number of cells (num_genes * coverage * representation).

• σ

The standard deviation expected for cells during FACS sorting. This should be set according to the biological variance experimentally observed, e.g. in the fluorescence intensity of isogenic cells

• bin_info

Range of guide phenotypes to collect in each bin

# Example

In the following example

p = 0.05
bin_info = Dict(:bin1 => (0.0, p), :bin2 => (1.0-p, 1.0))

The 5th percentile of cells sorted according to their phenotype (fluorescence, size, etc) will be compared to the 95th percentile.

• bottleneck_representation

Number of cells sorted expressed as an integer multiple of the number of guides

• seq_depth

Sequencing depth as a integer multiple of the number of guides, i.e. seq_depth=10 is equivalent to 10 * num_genes * coverage reads.

source

A type representing the parameters used in a typical growth-based screen.

• num_genes

Number of genes targeted by the screen

• coverage

Number of guides per gene

• representation

Number of cells with each guide. representation=10 means that there are 10 times as many cells as guides during transfection. The actual number of cells per guide post-transfection will be less depending on the MOI

• moi

The multiplicity of infection, $\lambda$, of the screen. We model this as a Poisson process during transfection (see Simulation.transfect).

Note

We do not model multiple infections. We assume that the MOI is properly selected and less than half the cells are transfected by any virus, i.e. $\lambda \lt 0.5$ and then select only the cells that have a single transfection occurrence:

$P(x = 1; Poisson(λ))$

For $\lambda = 0.25$ this works out to being $\approx 19.5\%$ of the number of cells (num_genes * coverage * representation).

• seq_depth

Sequencing depth as a integer multiple of the number of guides, i.e. seq_depth=10 is equivalent to 10 * num_genes * coverage reads.

• bottleneck_representation

For growth screens, how much of a bottleneck is applied. This is minimum number of cells that is kept when passaging the pool of cells. This is expressed as an integer multiple of the number of guides.

• num_bottlenecks

For growth screens, how many bottlenecks are applied. This is the integer number of passages during the growth screen.

• noise

Before each passage, the theoretical phenotype of each cell is convolved with a normal noise distribution with a standard deviation, σ, dictated by this parameter. This should be set based on an expectation of noisiness of the subsampling

source

## Key functions

construct_library(setup, lib)


Constructs the guide library for N genes with coverage number of guides per gene. Returns a tuple of guides and their relative frequencies (assigned randomly).

source
transfect(setup, lib, guides, guide_freqs_dist)


Simulates the transfection the of the library of guides (guides) according to the parameters in Simulation.Library lib and the frequencies represented by the Categorical distribution guide_freqs_dist.

This function creates a population of cells post-transfection where each cell has single guide RNA present according to the frequencies of the guides in the present in the library. The cells are then grown up so that there are setup.bottleneck_representation * num_guides of them. For FACS screens we assume a simple linear expansion of the cells, while for Growth screens the cells grow according as a function of their phenotype and degree of knockdown.

source
select(setup, cells, cell_phenotypes, guides; debug)


Given cells, a vector of integers, and guides, a vector of barcodes, performs simulated FACS sorting of the cells into bins with the given cutoffs. σ refers to the spread of the phenotype during the FACS screen.

source
select(setup, initial_cells, initial_cell_phenotypes, guides; debug)


Growth Screen selection

source
sequencing(depths, guides, samples)


Simulates next-gen sequencing by generating a Categorical distribution based on frequencies of each guide in each bin and then randomly samples from this distributions up to the depth provided in depths for each bin. Returns a dict of DataFrames with the bin names as the keys. This object can then be passed through Simulation.counts_to_freqs followed by Simulation.differences_between_bins.

source
counts_to_freqs(bin_cells, guide_count)


Converts raw counts to frequencies by dividing by the total number of reads for each sample.

source
differences_between_bins(raw_data)


Given the raw data from Simulation.sequencing returns two DataFrames

1. guide_data: This DataFrame contains the per-guide level data including the log2 fold change in the normalized frequencies of each guide between each pairwise combination of bins. Thus, if there are n bins, then it computes the log2 fold changes for the $\frac{n!}{2(n-2)!}$ combinations

2. gene_data: This DataFrame contains the same information but grouped by gene. The log2 fold change data from the first DataFrame is used to calculate the average log2 fold change per gene and a pvalue computed using a Mann-Whitney U-test as measure of how consistently shifted the guides are of this gene versus the population of negative control guides. (see below for more info)

A typical 2 bin guide_data DataFrame contains the following columns:

Column NameMeaning
genethe gene ID of that this guide targets
knockdownactivity of the guide on 0 to 1 scale, where 1 is complete knockout
barcodeidthe ID of this specific guide
theo_phenotypeexpected phenotype of this guide, generally a -1 to 1 scale
behaviorwhether the target gene displays a linear or sigmoidal response to incomplete knockdown (see Simulation.Library for more details)
classwhich phenotype distribution the target gene was drawn from (see Simulation.Library for more details). Serves as the "ground truth" label against which screen performance is evaluated, e.g. with Simulation.auprc
initial_freqfrequency of guide post-transfection (see Simulation.transfect)
counts_bin1the raw number of reads for each guide in the first bin
freqs_bin1the number of reads for each guide divided by the total number of reads in this bin
rel_freqs_bin1the frequency of each guide divided by the median frequency of negative control guides
counts_bin2the raw number of reads for each guide in the second bin
freqs_bin2the number of reads for each guide divided by the total number of reads in this bin
rel_freqs_bin2the frequency of each guide divided by the median frequency of negative control guides for this bin
log2fc_bin2_div_bin1the log2 fold change in relative guide frequencies between bin2 and bin1, equivalent to log2(rel_freqs_bin2/rel_freqs_bin1)

A typical gene_data DataFrame contains the following data:

Column NameMeaning
genethis gene's ID
classsee above
behaviorsee above
mean_bin2_div_bin1the mean log 2 fold change in relative frequencies between from bin1 to bin2 for all the guides targeting this gene. Calculated as $\frac{1}{k}\sum_k \text{log2fc_bin2_div_bin1}_k$ for the $k$ guides targeting each gene
pvalue_bin2_div_bin1the -log10 pvalue of the log2 fold changes of all guides targeting this gene as computed by the non-parametric Mann-Whitney U-test. A measure of the consistency of the log 2 fold changes[1]
absmean_bin2_div_bin1absolute value of mean_bin2_div_bin1 per-gene
pvalmeanprod_bin2_div_bin1mean_bin2_div_bin1 multiplied with the pvalue_bin2_div_bin1 per-gene

[1]

Kampmann M, Bassik MC, Weissman JS. Integrated platform for genome-wide screening and construction of high-density genetic interaction maps in mammalian cells. Proc Natl Acad Sci U S A. 2013;110:E2317–26.

source

## Miscellaneous Types

Wrapper containing all library construction parameters

• knockdown_dist

Distribution of guide knockdown efficiencies

• knockdown_probs

• max_phenotype_dists

Maximum phenotype categories mapped to their probability of being selected and the distribution to draw from if they are selected.

## Example

The basic layout is a Dict mapping a class name to a tuple of the probability of selecting this class and then the Distributions.Sampleable from which to draw a random phenotype from this class. The probabilities across all the classes should add up to 1.

max_phenotype_dists = Dict{Symbol, Tuple{Float64, Sampleable}}(
:inactive => (0.60, Delta(0.0)),
:negcontrol => (0.1, Delta(0.0)),
:increasing => (0.3, TruncatedNormal(0.1, 0.1, 0.025, 1)),
);
Library(max_phenotype_dists, CRISPRi());

For example, here we are making three different classes of "genes": the first group are :inactive, i.e. they have no phenotype, so we'll set their phenotypes to 0.0 using a Simulation.Delta. We'll also make them 60% of all the genes. The second group are the negative controls :negcontrol (the only required group) which make up 10% of the population of genes and also have no effect. The final group is :increasing which makes up 30% of all genes and which are represented by a Normal(μ=0.1, σ=0.1) distribution clamped between 0.025 and 1.

source

Any entity that is tracked through the pooled experiment. For CRISPR screens, this is equivalent to the sgRNA. This object stores properties relating to the performance of this entity in the screen.

• gene

The target gene id

• knockdown

The knockdown efficiency

• theo_phenotype

The theoretical phenotype exerted

• obs_phenotype

The observed phenotype (only relevant for FACS screens)

• behavior

The gene behavior, either sigmoidal or linear

• class

The gene activity class, either inactive, up or down

• initial_freq

Initial frequency after transfection

source

A type representing a relationship between degree of knockdown and effect on phenotype

source

A type representing the behavior of different Cas9s

source
type CRISPRn <: Simulation.Cas9Behavior

CRISPR KO behavior is more complex since sgRNA-directed DNA damage repair is stochastic. We assume that 2/3 of repair events at a given locus lead to a frameshift, and that the screen is carried out in diploid cells. The assumption that only bi-allelic frame-shift mutations lead to a phenotype in CRISPRn screens for most sgRNAs is supported by the empirical finding that in-frame deletions mostly do not show strong phenotypes, unless they occur in regions encoding conserved residues or domains[2]

[2]

Horlbeck MA, Gilbert LA, Villalta JE, Adamson B, Pak RA, Chen Y, Fields AP, Park CY, Corn JE, Kampmann M, Weissman JS: Compact and highly active next- generation libraries for CRISPR-mediated gene repression and activation. Elife 2016, 5.

source
type CRISPRi <: Simulation.Cas9Behavior

CRISPRi behavior is simply determined by the activity of the guide

source
type Delta <: Distributions.Sampleable{Distributions.Univariate,Distributions.Discrete}

Constructs a delta function at a given δ value. This distribution always emits the same value.

source

## Miscellaneous functions

signal(guide_data; log2fc_col)


Computes the signal in an experiment, where the experimental signal is defined to be the average of signal of true hit genes. That value, the true hit gene signal, is the average ratio of the log2 fold change for a guide targeting a specific gene to the guide's theoretical phenotype.

$\frac{1}{N_{true} \times k} \sum_{i=1}^{N_{true}} \sum_{j=1}^k \frac{\log_2 fc_{ij}}{\text{theo phenotype}_{ij}}$

where $N_{true}$ is the number of true hit genes and $k$ is the number of genes.

source
noise(guide_data; log2fc_col)


Computes the noise in an experiment, where noise is defined to be the standard deviation of log2 fold change in the negative controls. If more than 2 bins are used then the name of log2 fold change information can be provided to differentiate between multiple log 2 fold changes.

source
linear(x, l)


Given value(s) x apply a simple linear function with maximum value l

source
sigmoid(x, l, k, p)


Given value(s) x apply the sigmoidal function with maximum value l, a steepness k, and an inflection point p.

source
auprc(scores, classes, pos_labels; rev)


Computes the area under the Precision-Recall curve using a lower trapezoidal estimator, which is more accurate for skewed datasets.

K. Boyd, K. H. Eng, and C. D. Page, “Area under the Precision-Recall Curve: Point Estimates and Confidence Intervals,” in Machine Learning and Knowledge Discovery in Databases, H. Blockeel, K. Kersting, S. Nijssen, and F. Železný, Eds. Springer Berlin Heidelberg, 2013, pp. 451–466.

source
auroc(scores, classes, pos_labels; rev)


Optimized function for computing the area under the receiver operator characteristic curve.

source
venn(scores, classes, pos_labels; rev)


Given N positive examples, computes the percentage of the top N/2 hits that are correct

source