# Simulation Internals

More in-depth documentation on specific types and functions.

## Contents

## Index

`Simulation.Barcode`

`Simulation.CRISPRi`

`Simulation.CRISPRn`

`Simulation.Cas9Behavior`

`Simulation.Delta`

`Simulation.FacsScreen`

`Simulation.GrowthScreen`

`Simulation.KDPhenotypeRelationship`

`Simulation.Library`

`Base.Sort.select`

`Simulation.auprc`

`Simulation.auroc`

`Simulation.construct_library`

`Simulation.counts_to_freqs`

`Simulation.differences_between_bins`

`Simulation.linear`

`Simulation.noise`

`Simulation.sequencing`

`Simulation.sigmoid`

`Simulation.signal`

`Simulation.transfect`

`Simulation.venn`

## Simulation Types

`Simulation.FacsScreen`

— Type.A type representing the parameters used in a typical FACS screen.

`num_genes`

Number of genes targeted by the screen

`coverage`

Number of guides per gene

`representation`

Number of cells with each guide.

`representation=10`

means that there are 10 times as many cells as guides during transfection. The actual number of cells per guide post-transfection will be less depending on the MOI`moi`

The multiplicity of infection, $\lambda$, of the screen. We model this as a Poisson process during transfection (see

`Simulation.transfect`

).NoteWe

**do not**model multiple infections. We assume that the MOI is properly selected and less than half the cells are transfected by any virus, i.e. $\lambda \lt 0.5$ and then select only the cells that have a single transfection occurrence:\[P(x = 1; Poisson(λ))\]For $\lambda = 0.25$ this works out to being $\approx 19.5\%$ of the number of cells (

`num_genes`

*`coverage`

*`representation`

).

`σ`

The standard deviation expected for cells during FACS sorting. This should be set according to the biological variance experimentally observed, e.g. in the fluorescence intensity of isogenic cells

`bin_info`

Range of guide phenotypes to collect in each bin

# Example

In the following example

`p = 0.05 bin_info = Dict(:bin1 => (0.0, p), :bin2 => (1.0-p, 1.0))`

The 5th percentile of cells sorted according to their phenotype (fluorescence, size, etc) will be compared to the 95th percentile.

`bottleneck_representation`

Number of cells sorted expressed as an integer multiple of the number of guides

`seq_depth`

Sequencing depth as a integer multiple of the number of guides, i.e.

`seq_depth=10`

is equivalent to`10 * num_genes * coverage`

reads.

`Simulation.GrowthScreen`

— Type.A type representing the parameters used in a typical growth-based screen.

`num_genes`

Number of genes targeted by the screen

`coverage`

Number of guides per gene

`representation`

Number of cells with each guide.

`representation=10`

means that there are 10 times as many cells as guides during transfection. The actual number of cells per guide post-transfection will be less depending on the MOI`moi`

The multiplicity of infection, $\lambda$, of the screen. We model this as a Poisson process during transfection (see

`Simulation.transfect`

).NoteWe

**do not**model multiple infections. We assume that the MOI is properly selected and less than half the cells are transfected by any virus, i.e. $\lambda \lt 0.5$ and then select only the cells that have a single transfection occurrence:\[P(x = 1; Poisson(λ))\]For $\lambda = 0.25$ this works out to being $\approx 19.5\%$ of the number of cells (

`num_genes`

*`coverage`

*`representation`

).

`seq_depth`

Sequencing depth as a integer multiple of the number of guides, i.e.

`seq_depth=10`

is equivalent to`10 * num_genes * coverage`

reads.

`bottleneck_representation`

For growth screens, how much of a bottleneck is applied. This is minimum number of cells that is kept when passaging the pool of cells. This is expressed as an integer multiple of the number of guides.

`num_bottlenecks`

For growth screens, how many bottlenecks are applied. This is the integer number of passages during the growth screen.

`noise`

Before each passage, the theoretical phenotype of each cell is convolved with a normal noise distribution with a standard deviation, σ, dictated by this parameter. This should be set based on an expectation of noisiness of the subsampling

## Key functions

`Simulation.construct_library`

— Function.```
construct_library(setup, lib)
```

Constructs the guide library for `N`

genes with `coverage`

number of guides per gene. Returns a tuple of guides and their relative frequencies (assigned randomly).

`Simulation.transfect`

— Function.```
transfect(setup, lib, guides, guide_freqs_dist)
```

Simulates the transfection the of the library of guides (`guides`

) according to the parameters in `Simulation.Library`

`lib`

and the frequencies represented by the Categorical distribution `guide_freqs_dist`

.

This function creates a population of cells post-transfection where each cell has single guide RNA present according to the frequencies of the guides in the present in the library. The cells are then grown up so that there are `setup.bottleneck_representation * num_guides`

of them. For FACS screens we assume a simple linear expansion of the cells, while for Growth screens the cells grow according as a function of their phenotype and degree of knockdown.

`Base.Sort.select`

— Function.```
select(setup, cells, cell_phenotypes, guides; debug)
```

Given `cells`

, a vector of integers, and `guides`

, a vector of barcodes, performs simulated FACS sorting of the cells into `bins`

with the given cutoffs. `σ`

refers to the spread of the phenotype during the FACS screen.

```
select(setup, initial_cells, initial_cell_phenotypes, guides; debug)
```

Growth Screen selection

`Simulation.sequencing`

— Function.```
sequencing(depths, guides, samples)
```

Simulates next-gen sequencing by generating a Categorical distribution based on frequencies of each guide in each bin and then randomly samples from this distributions up to the depth provided in `depths`

for each bin. Returns a dict of DataFrames with the bin names as the keys. This object can then be passed through `Simulation.counts_to_freqs`

followed by `Simulation.differences_between_bins`

.

`Simulation.counts_to_freqs`

— Function.```
counts_to_freqs(bin_cells, guide_count)
```

Converts raw counts to frequencies by dividing by the total number of reads for each sample.

`Simulation.differences_between_bins`

— Function.```
differences_between_bins(raw_data)
```

Given the raw data from `Simulation.sequencing`

returns two DataFrames

`guide_data`

: This DataFrame contains the per-guide level data including the log2 fold change in the normalized frequencies of each guide between each pairwise combination of bins. Thus, if there are`n`

bins, then it computes the log2 fold changes for the $\frac{n!}{2(n-2)!}$ combinations`gene_data`

: This DataFrame contains the same information but grouped by gene. The log2 fold change data from the first DataFrame is used to calculate the average log2 fold change per gene and a pvalue computed using a Mann-Whitney U-test as measure of how consistently shifted the guides are of this gene versus the population of negative control guides. (see below for more info)

A typical 2 bin `guide_data`

DataFrame contains the following columns:

Column Name | Meaning |
---|---|

`gene` | the gene ID of that this guide targets |

`knockdown` | activity of the guide on 0 to 1 scale, where 1 is complete knockout |

`barcodeid` | the ID of this specific guide |

`theo_phenotype` | expected phenotype of this guide, generally a -1 to 1 scale |

`behavior` | whether the target gene displays a linear or sigmoidal response to incomplete knockdown (see `Simulation.Library` for more details) |

`class` | which phenotype distribution the target gene was drawn from (see `Simulation.Library` for more details). Serves as the "ground truth" label against which screen performance is evaluated, e.g. with `Simulation.auprc` |

`initial_freq` | frequency of guide post-transfection (see `Simulation.transfect` ) |

`counts_bin1` | the raw number of reads for each guide in the first bin |

`freqs_bin1` | the number of reads for each guide divided by the total number of reads in this bin |

`rel_freqs_bin1` | the frequency of each guide divided by the median frequency of negative control guides |

`counts_bin2` | the raw number of reads for each guide in the second bin |

`freqs_bin2` | the number of reads for each guide divided by the total number of reads in this bin |

`rel_freqs_bin2` | the frequency of each guide divided by the median frequency of negative control guides for this bin |

`log2fc_bin2_div_bin1` | the log2 fold change in relative guide frequencies between `bin2` and `bin1` , equivalent to `log2(rel_freqs_bin2/rel_freqs_bin1)` |

A typical `gene_data`

DataFrame contains the following data:

Column Name | Meaning |
---|---|

`gene` | this gene's ID |

`class` | see above |

`behavior` | see above |

`mean_bin2_div_bin1` | the mean log 2 fold change in relative frequencies between from `bin1` to `bin2` for all the guides targeting this gene. Calculated as $\frac{1}{k}\sum_k \text{log2fc_bin2_div_bin1}_k$ for the $k$ guides targeting each gene |

`pvalue_bin2_div_bin1` | the -log10 pvalue of the log2 fold changes of all guides targeting this gene as computed by the non-parametric Mann-Whitney U-test. A measure of the consistency of the log 2 fold changes[1] |

`absmean_bin2_div_bin1` | absolute value of `mean_bin2_div_bin1` per-gene |

`pvalmeanprod_bin2_div_bin1` | `mean_bin2_div_bin1` multiplied with the `pvalue_bin2_div_bin1` per-gene |

**Further reading**

**[1]**

Kampmann M, Bassik MC, Weissman JS. Integrated platform for genome-wide screening and construction of high-density genetic interaction maps in mammalian cells. *Proc Natl Acad Sci U S A*. 2013;110:E2317–26.

## Miscellaneous Types

`Simulation.Library`

— Type.Wrapper containing all library construction parameters

`knockdown_dist`

Distribution of guide knockdown efficiencies

`knockdown_probs`

`max_phenotype_dists`

Maximum phenotype categories mapped to their probability of being selected and the distribution to draw from if they are selected.

## Example

The basic layout is a

`Dict`

mapping a class name to a tuple of the probability of selecting this class and then the`Distributions.Sampleable`

from which to draw a random phenotype from this class. The probabilities across all the classes should add up to 1.`max_phenotype_dists = Dict{Symbol, Tuple{Float64, Sampleable}}( :inactive => (0.60, Delta(0.0)), :negcontrol => (0.1, Delta(0.0)), :increasing => (0.3, TruncatedNormal(0.1, 0.1, 0.025, 1)), ); Library(max_phenotype_dists, CRISPRi());`

For example, here we are making three different classes of "genes": the first group are :inactive, i.e. they have no phenotype, so we'll set their phenotypes to 0.0 using a

`Simulation.Delta`

. We'll also make them 60% of all the genes. The second group are the negative controls :negcontrol (the only required group) which make up 10% of the population of genes and also have no effect. The final group is :increasing which makes up 30% of all genes and which are represented by a Normal(μ=0.1, σ=0.1) distribution clamped between 0.025 and 1.

`phenotype_probs`

`kd_phenotype_relationships`

Knockdown-phenotype relationships mapped to their probability of being selected and their respective

`Simulation.KDPhenotypeRelationship`

`relationship_probs`

`cas9_behavior`

Whether this library is

`Simulation.CRISPRi`

or`Simulation.CRISPRn`

`Simulation.Barcode`

— Type.Any entity that is tracked through the pooled experiment. For CRISPR screens, this is equivalent to the sgRNA. This object stores properties relating to the performance of this entity in the screen.

`gene`

The target gene id

`knockdown`

The knockdown efficiency

`theo_phenotype`

The theoretical phenotype exerted

`obs_phenotype`

The observed phenotype (only relevant for FACS screens)

`behavior`

The gene behavior, either sigmoidal or linear

`class`

The gene activity class, either inactive, up or down

`initial_freq`

Initial frequency after transfection

A type representing a relationship between degree of knockdown and effect on phenotype

`Simulation.Cas9Behavior`

— Type.A type representing the behavior of different Cas9s

`Simulation.CRISPRn`

— Type.`type CRISPRn <: Simulation.Cas9Behavior`

CRISPR KO behavior is more complex since sgRNA-directed DNA damage repair is stochastic. We assume that 2/3 of repair events at a given locus lead to a frameshift, and that the screen is carried out in diploid cells. The assumption that only bi-allelic frame-shift mutations lead to a phenotype in CRISPRn screens for most sgRNAs is supported by the empirical finding that in-frame deletions mostly do not show strong phenotypes, unless they occur in regions encoding conserved residues or domains[2]

**[2]**

Horlbeck MA, Gilbert LA, Villalta JE, Adamson B, Pak RA, Chen Y, Fields AP, Park CY, Corn JE, Kampmann M, Weissman JS: Compact and highly active next- generation libraries for CRISPR-mediated gene repression and activation. *Elife* 2016, 5.

`Simulation.CRISPRi`

— Type.`type CRISPRi <: Simulation.Cas9Behavior`

CRISPRi behavior is simply determined by the activity of the guide

`Simulation.Delta`

— Type.`type Delta <: Distributions.Sampleable{Distributions.Univariate,Distributions.Discrete}`

Constructs a delta function at a given δ value. This distribution always emits the same value.

## Miscellaneous functions

`Simulation.signal`

— Function.```
signal(guide_data; log2fc_col)
```

Computes the signal in an experiment, where the experimental signal is defined to be the average of signal of true hit genes. That value, the true hit gene signal, is the average ratio of the log2 fold change for a guide targeting a specific gene to the guide's theoretical phenotype.

where $N_{true}$ is the number of true hit genes and $k$ is the number of genes.

`Simulation.noise`

— Function.```
noise(guide_data; log2fc_col)
```

Computes the noise in an experiment, where noise is defined to be the standard deviation of log2 fold change in the negative controls. If more than 2 bins are used then the name of log2 fold change information can be provided to differentiate between multiple log 2 fold changes.

`Simulation.linear`

— Function.```
linear(x, l)
```

Given value(s) `x`

apply a simple linear function with maximum value `l`

`Simulation.sigmoid`

— Function.```
sigmoid(x, l, k, p)
```

Given value(s) `x`

apply the sigmoidal function with maximum value `l`

, a steepness `k`

, and an inflection point `p`

.

`Simulation.auprc`

— Function.```
auprc(scores, classes, pos_labels; rev)
```

Computes the area under the Precision-Recall curve using a lower trapezoidal estimator, which is more accurate for skewed datasets.

K. Boyd, K. H. Eng, and C. D. Page, “Area under the Precision-Recall Curve: Point Estimates and Confidence Intervals,” in Machine Learning and Knowledge Discovery in Databases, H. Blockeel, K. Kersting, S. Nijssen, and F. Železný, Eds. Springer Berlin Heidelberg, 2013, pp. 451–466.

`Simulation.auroc`

— Function.```
auroc(scores, classes, pos_labels; rev)
```

Optimized function for computing the area under the receiver operator characteristic curve.

`Simulation.venn`

— Function.```
venn(scores, classes, pos_labels; rev)
```

Given `N`

positive examples, computes the percentage of the top `N/2`

hits that are correct