Custom Simulations
This is an example of building a custom simulation of a FACS-based screen.
This section is complementary to the Implementation section of the paper
Setup
Lets first start the Julia REPL or a Julia session inside of a Jupyter Notebook and load the packages we'll need:
using Crispulator
using Gadfly
Basic screen parameters
First lets design a simple Crispulator.FacsScreen
with 250 genes with 5 guides per gene. Lets say that we make sure we have 1000x as many cells as guides during transfection (representation
) and sorting (bottleneck_representation
) and 1000x as many reads as guides during sequencing (seq_depth
). We'll leave the rest of the values as the defaults and print the object
s = FacsScreen()
s.num_genes = 250
s.coverage = 5
s.representation = 1000
s.bottleneck_representation = 1000
s.seq_depth = 1000
println(s)
FacsScreen <: ScreenSetup
Number of genes: 250
Number of guides per gene: 5 (1250 guides total)
Cells per guide at transfection: 1000 (1250000 cells total)
Multiplicity of infection (M.O.I.): 0.25
Number of cells per guide at sequencing: 1000 (1250000 cells total)
Number of cells per guide sorted: 1000 (1250000 cells total)
Gaussian standard deviation of sorting: 1.0 (phenotype units)
Bins:
bin1: 0-33 percentile of cells
bin2: 67-100 percentile of cells
Construction of true phenotype distribution
Next, lets make our distribution of true phenotypes. The basic layout is a Dict
mapping a class name to a tuple of the probability of selecting this class and then the Distributions.Sampleable
from which to draw a random phenotype from this class. The probabilities across all the classes should add up to 1.
For example, here we are making three different classes of "genes": the first group are :inactive
, i.e. they have no phenotype, so we'll set their phenotypes to 0.0 using a Crispulator.Delta
. We'll also make them 60% of all the genes. The second group are the negative controls :negcontrol
(the only required group) which make up 10% of the population of genes and also have no effect. The final group is :increasing
which makes up 30% of all genes and which are represented by a Normal(μ=0.1, σ=0.1) distribution clamped between 0.025 and 1.
max_phenotype_dists = Dict{Symbol, Tuple{Float64, Sampleable}}(
:inactive => (0.60, Delta(0.0)),
:negcontrol => (0.1, Delta(0.0)),
:increasing => (0.3, truncated(Normal(0.1, 0.1), 0.025, 1)),
);
Dict{Symbol, Tuple{Float64, Distributions.Sampleable}} with 3 entries:
:negcontrol => (0.1, Delta(0.0))
:inactive => (0.6, Delta(0.0))
:increasing => (0.3, Truncated(Distributions.Normal{Float64}(μ=0.1, σ=0.1); l…
The :negcontrol
class needs to be present because Crispulator normalizes the frequencies of all other guides against the median frequency of the negative control guides. Also the distribution of :negcontrol
guides serve as the null distribution against which the log2 fold changes of guides targeting a specific gene are assayed to calculate a statistical significance of the shift for each gene. See Crispulator.differences_between_bins
for more details.
Library construction
Now, we actually build the library. Here we're making a Crispulator.CRISPRi
library and then getting the guides that were built from the true phenotype distribution that we constructed above and we also get the frequency of each guide in the library.
lib = Library(max_phenotype_dists, CRISPRi())
guides, guide_freqs_dist = construct_library(s, lib);
(Crispulator.Barcode[Crispulator.Barcode(1, 0.878120833563954, 0.02601178650831795, NaN, :linear, :increasing, -Inf), Crispulator.Barcode(1, 0.965258105212688, 0.028592975816678773, NaN, :linear, :increasing, -Inf), Crispulator.Barcode(1, 0.6715851400901364, 0.019893764751353144, NaN, :linear, :increasing, -Inf), Crispulator.Barcode(1, 0.8544841624248554, 0.025311618581607694, NaN, :linear, :increasing, -Inf), Crispulator.Barcode(1, 0.9852213704566297, 0.029184329732547937, NaN, :linear, :increasing, -Inf), Crispulator.Barcode(2, 0.8380539512951146, 0.0, NaN, :linear, :inactive, -Inf), Crispulator.Barcode(2, 0.8245079147510144, 0.0, NaN, :linear, :inactive, -Inf), Crispulator.Barcode(2, 0.7575091121895536, 0.0, NaN, :linear, :inactive, -Inf), Crispulator.Barcode(2, 0.0515177239881524, 0.0, NaN, :linear, :inactive, -Inf), Crispulator.Barcode(2, 0.7280375459151591, 0.0, NaN, :linear, :inactive, -Inf) … Crispulator.Barcode(249, 0.8727877400094837, 0.0, NaN, :sigmoidal, :negcontrol, -Inf), Crispulator.Barcode(249, 0.9048341862531871, 0.0, NaN, :sigmoidal, :negcontrol, -Inf), Crispulator.Barcode(249, 0.9942189015695218, 0.0, NaN, :sigmoidal, :negcontrol, -Inf), Crispulator.Barcode(249, 0.9724247798455464, 0.0, NaN, :sigmoidal, :negcontrol, -Inf), Crispulator.Barcode(249, 0.9407497886099611, 0.0, NaN, :sigmoidal, :negcontrol, -Inf), Crispulator.Barcode(250, 0.7458915279599929, 0.22431119015876855, NaN, :linear, :increasing, -Inf), Crispulator.Barcode(250, 0.9839303489047019, 0.295896359353721, NaN, :linear, :increasing, -Inf), Crispulator.Barcode(250, 0.8343090367030173, 0.2509008964009943, NaN, :linear, :increasing, -Inf), Crispulator.Barcode(250, 0.9771582168942925, 0.29385978307657534, NaN, :linear, :increasing, -Inf), Crispulator.Barcode(250, 0.6666785376725564, 0.20048954926145715, NaN, :linear, :increasing, -Inf)], Distributions.Categorical{Float64, Vector{Float64}}(
support: Base.OneTo(1250)
p: [0.0007827106886745094, 0.0005121104490352278, 0.0003537343914889569, 0.0005651772283505308, 0.00015159386514736474, 0.0007283282074873836, 0.0028838202139263775, 0.0006046840669220868, 0.0003547646012889437, 0.00013371127911394548 … 0.0005789135617149528, 0.0011189923130810423, 0.0011109256391882178, 0.0002562241704448103, 0.00045358134346869657, 0.00031573307674935845, 0.00041388113433962574, 0.0002437956549870102, 0.0005827199496525896, 0.0006209619491373765]
)
)
Lets first look at what the true phenotype distribution of our different classes of guides looks like
df = DataFrame(Dict(
:phenotype=>map(x->x.theo_phenotype, guides),
:class=>map(x->x.class, guides),
:freq=>pdf.(guide_freqs_dist, 1:(length(guides)))
))
plot(df, x=:phenotype, color=:class, Geom.histogram, Guide.ylabel("Number of guides"),
Guide.title("Guide phenotype distribution"))
As you can see, most guides should have a phenotype of 0. In FACS Screens this is equivalent to having no preference to being in either the left (bin1
) or right (bin2
) bins. The :increasing
genes have a small preference to be in the right bin.
We can also look at the frequency of each guide in the library, which follows a Log-Normal distribution.
plot(df, x=:freq, color=:class, Geom.histogram(position=:stack),
Guide.xlabel("Frequency"), Guide.ylabel("Number of guides"),
Guide.title("Frequencies of guides in simulated library"))
Performing the screen
Now, we'll actually perform the screen. We'll first perform the transection via Crispulator.transfect
, followed by the selection process via Crispulator.select
:
cells, cell_phenotypes = transfect(s, lib, guides, guide_freqs_dist)
bin_cells = Crispulator.select(s, cells, cell_phenotypes, guides)
freqs = counts_to_freqs(bin_cells, length(guides));
Dict{Symbol, Vector{Float64}} with 2 entries:
:bin1 => [0.000836158, 0.000423215, 0.000285568, 0.000585516, 0.00014792, 0.0…
:bin2 => [0.000780687, 0.000433487, 0.000256805, 0.000540317, 0.000156137, 0.…
Lets look at what the observed phenotype distribution looks like when the selection was performed:
df = DataFrame(Dict(
:phenotype=>map(x->x.theo_phenotype, guides),
:class=>map(x->x.class, guides),
:obs_freq=>map(x->x.obs_phenotype, guides)
))
plot(df, x=:obs_freq, Geom.density, Guide.xlabel("Observed phenotype on FACS machine"),
Guide.title("Kernel density estimate of guide observed phenotypes"), Guide.ylabel("ρ"))
As you can see, this looks like many FACS plots, e.g. when looking at density along the fluorescence channel. A quick sanity check is that we should see a slight enrichment of the frequency of :increasing
genes on the right side
plot(df, x=:obs_freq, color=:class, Geom.density, Guide.xlabel("Observed phenotype on FACS machine"),
Guide.title("Kernel density estimate of guide observed phenotypes"), Guide.ylabel("ρ"))
And that is what we see. The change is really small (this is pretty usual), but the later analysis will be able to pull out the increasing
genes.
Sequencing and Analysis
Now we'll use Crispulator.sequencing
to simulate sequencing by transforming the guide frequencies into a Categorical distribution and drawing a random sample of reads from this distribution. Finally, we'll use the Crispulator.differences_between_bins
function to compute the differences between bins on a per-guide level (guide_data
) and per-gene level (gene_data
).
raw_data = sequencing(Dict(:bin1=>s.seq_depth, :bin2=>s.seq_depth), guides, freqs)
guide_data, gene_data = differences_between_bins(raw_data);
(1250×17 DataFrame
Row │ gene knockdown theo_phenotype behavior class initial_freq ⋯
│ Int64 Float64 Float64 Symbol Symbol Float64 ⋯
──────┼─────────────────────────────────────────────────────────────────────────
1 │ 1 0.878121 0.0260118 linear increasing 0.000801233 ⋯
2 │ 1 0.965258 0.028593 linear increasing 0.000431433
3 │ 1 0.671585 0.0198938 linear increasing 0.000275295
4 │ 1 0.854484 0.0253116 linear increasing 0.000567026
5 │ 1 0.985221 0.0291843 linear increasing 0.000139702 ⋯
6 │ 2 0.838054 0.0 linear inactive 0.000813559
7 │ 2 0.824508 0.0 linear inactive 0.00292142
8 │ 2 0.757509 0.0 linear inactive 0.000624551
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
1244 │ 249 0.972425 0.0 sigmoidal negcontrol 0.000287622 ⋯
1245 │ 249 0.94075 0.0 sigmoidal negcontrol 0.000419106
1246 │ 250 0.745892 0.224311 linear increasing 0.000320493
1247 │ 250 0.98393 0.295896 linear increasing 0.000439651
1248 │ 250 0.834309 0.250901 linear increasing 0.000242424 ⋯
1249 │ 250 0.977158 0.29386 linear increasing 0.000677966
1250 │ 250 0.666679 0.20049 linear increasing 0.000579353
11 columns and 1235 rows omitted, 214×7 DataFrame
Row │ gene behavior class pvalue_bin2_div_bin1 mean_bin2_div_bin1 ⋯
│ Int64 Symbol Symbol Float64 Float64 ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 1 linear increasing 0.0164375 0.0498722 ⋯
2 │ 2 linear inactive 0.51767 -0.0181567
3 │ 3 linear increasing 3.61572 0.620516
4 │ 4 linear increasing 1.23321 0.235206
5 │ 5 sigmoidal increasing 3.09029 0.364646 ⋯
6 │ 6 linear inactive 0.044653 0.107356
7 │ 7 linear increasing 3.52993 0.49624
8 │ 8 sigmoidal inactive 0.124439 0.103894
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
208 │ 242 linear inactive 0.0544275 0.108626 ⋯
209 │ 243 sigmoidal inactive 0.611977 0.172729
210 │ 244 sigmoidal increasing 0.0918785 0.0978016
211 │ 245 linear inactive 1.19167 -0.030791
212 │ 247 linear increasing 2.41616 0.655583 ⋯
213 │ 248 linear increasing 3.57269 0.544639
214 │ 250 linear increasing 3.82001 0.796165
2 columns and 199 rows omitted)
Here's what the per-guide data looks like:
Row | gene | knockdown | theo_phenotype | behavior | class | initial_freq | counts | barcodeid | freqs | rel_freqs | counts_bin1 | freqs_bin1 | rel_freqs_bin1 | counts_bin2 | freqs_bin2 | rel_freqs_bin2 | log2fc_bin2_div_bin1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Symbol | Symbol | Float64 | Float64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1 | 0.878121 | 0.0260118 | linear | increasing | 0.000801233 | 1032.5 | 1 | 0.000825587 | 1.18542 | 1032.5 | 0.000825587 | 1.18542 | 1009.5 | 0.000807196 | 1.30595 | 0.139703 |
2 | 1 | 0.965258 | 0.028593 | linear | increasing | 0.000431433 | 528.5 | 2 | 0.000422589 | 0.606774 | 528.5 | 0.000422589 | 0.606774 | 539.5 | 0.000431384 | 0.69793 | 0.201924 |
3 | 1 | 0.671585 | 0.0198938 | linear | increasing | 0.000275295 | 371.5 | 3 | 0.000297051 | 0.426521 | 371.5 | 0.000297051 | 0.426521 | 304.5 | 0.000243478 | 0.39392 | -0.114716 |
4 | 1 | 0.854484 | 0.0253116 | linear | increasing | 0.000567026 | 832.5 | 4 | 0.000665667 | 0.955798 | 832.5 | 0.000665667 | 0.955798 | 642.5 | 0.000513743 | 0.831177 | -0.20155 |
5 | 1 | 0.985221 | 0.0291843 | linear | increasing | 0.000139702 | 191.5 | 5 | 0.000153123 | 0.219862 | 191.5 | 0.000153123 | 0.219862 | 198.5 | 0.000158721 | 0.256792 | 0.223999 |
6 | 2 | 0.838054 | 0.0 | linear | inactive | 0.000813559 | 1011.5 | 6 | 0.000808796 | 1.16131 | 1011.5 | 0.000808796 | 1.16131 | 964.5 | 0.000771214 | 1.24774 | 0.103561 |
7 | 2 | 0.824508 | 0.0 | linear | inactive | 0.00292142 | 3849.5 | 7 | 0.00307806 | 4.41963 | 3849.5 | 0.00307806 | 4.41963 | 3557.5 | 0.00284458 | 4.6022 | 0.058397 |
8 | 2 | 0.757509 | 0.0 | linear | inactive | 0.000624551 | 758.5 | 8 | 0.000606497 | 0.870838 | 758.5 | 0.000606497 | 0.870838 | 732.5 | 0.000585707 | 0.947607 | 0.121884 |
9 | 2 | 0.0515177 | 0.0 | linear | inactive | 0.000299949 | 373.5 | 9 | 0.000298651 | 0.428817 | 373.5 | 0.000298651 | 0.428817 | 327.5 | 0.000261869 | 0.423674 | -0.017409 |
10 | 2 | 0.728038 | 0.0 | linear | inactive | 0.000115049 | 146.5 | 10 | 0.000117141 | 0.168197 | 146.5 | 0.000117141 | 0.168197 | 101.5 | 8.11594e-5 | 0.131307 | -0.357217 |
See Crispulator.differences_between_bins
for details on what each column means.
And the gene level data
Row | gene | behavior | class | pvalue_bin2_div_bin1 | mean_bin2_div_bin1 | absmean_bin2_div_bin1 | pvalmeanprod_bin2_div_bin1 |
---|---|---|---|---|---|---|---|
Int64 | Symbol | Symbol | Float64 | Float64 | Float64 | Float64 | |
1 | 1 | linear | increasing | 0.0164375 | 0.0498722 | 0.0498722 | 0.000819775 |
2 | 2 | linear | inactive | 0.51767 | -0.0181567 | 0.0181567 | -0.00939919 |
3 | 3 | linear | increasing | 3.61572 | 0.620516 | 0.620516 | 2.24361 |
4 | 4 | linear | increasing | 1.23321 | 0.235206 | 0.235206 | 0.290059 |
5 | 5 | sigmoidal | increasing | 3.09029 | 0.364646 | 0.364646 | 1.12686 |
6 | 6 | linear | inactive | 0.044653 | 0.107356 | 0.107356 | 0.00479377 |
7 | 7 | linear | increasing | 3.52993 | 0.49624 | 0.49624 | 1.75169 |
8 | 8 | sigmoidal | inactive | 0.124439 | 0.103894 | 0.103894 | 0.0129284 |
9 | 9 | linear | increasing | 3.23803 | 0.381213 | 0.381213 | 1.23438 |
10 | 10 | sigmoidal | inactive | 0.398435 | 0.137711 | 0.137711 | 0.0548688 |
We can generate standard pooled screen plots from this dataset. Like a count scatterplot:
nopseudo = guide_data[(guide_data[!, :counts_bin1] .> 0.5) .& (guide_data[!, :counts_bin2] .> 0.5), :]
plot(nopseudo, x=:counts_bin1, y=:counts_bin2, color=:class, Scale.x_log10,
Scale.y_log10, Theme(highlight_width=0pt), Coord.cartesian(fixed=true),
Guide.xlabel("log counts bin1"), Guide.ylabel("log counts bin2"))
And a volcano plot:
plot(gene_data, x=:mean_bin2_div_bin1, y=:pvalue_bin2_div_bin1, color=:class, Theme(highlight_width=0pt),
Guide.xlabel("mean log2 fold change"), Guide.ylabel("-log10 pvalue"))
And finally we can see how well we can differentiate between the different classes using Area Under the Precision-Recall Curve (Crispulator.auprc
)
auprc(gene_data[!, :pvalmeanprod_bin2_div_bin1], gene_data[!, :class], Set([:increasing]))[1]
0.9206869819545944
Crispulator.auroc
and Crispulator.venn
are also good summary statistics.