At Counsyl, we test for a variety of autosomal recessive diseases. It turns out that not all genetic diseases are as simple as a single base mutation; some diseases occur as a result of a gene with varying copy number. When a person’s chromosomes have enough copies of the gene, they are healthy, but if there are no copies of the gene, complications occur. In order to accurately detect carriers of these types of diseases, we need more than just biological assays — we need reliable algorithms to process the data we get from the biochemical reactions we run and determine the likelihood of carrying the genetic disorder.

## Determining Gene Copy Numbers

Simplifying a bit for the sake of purity and ignoring some gruesome real-world details of biology, what we want to do is detect how many copies of the gene in question someone has. If they have one copy, they’re definitely a carrier – at least one of their chromosomes is completely missing the gene. If they have three copies or more, they’re almost certainly

*not*a carrier — the probability of having all three copies on the same chromosome is vanishingly small. If they have two copies, they’re probabably not a carrier, although there is some ancestry-dependent probability that both copies are on the same chromosome. (This is ancestry-dependent because the concentration of 2-copy chromosomes is different in different human populations, so the probability of having one depends on where your ancestors came from.)

In our lab, we use a reaction called

**quantitative polymerase chain reaction**(qPCR for short) to test for these copy-number variations. The

**polymerase chain reaction**is a method to amplify DNA. Starting with the sample, free nucleotides, short DNA sequences called “primers”, and an enzyme called DNA polymerase, PCR amplifies samples exponentially in a number of cycles. In each cycle of PCR, the polymerase will use the spare nucleotides floating around in the solution to make copies of the DNA from the sample, approximately doubling the amount of DNA (the primers are used to get the reaction started). The “quantitative” in qPCR refers to the ability to measure the concentration of DNA in the reaction after every cycle of PCR, rather than just at the end. At the beginning, the concentration grows exponentially (because each cycle approximately results in a doubling of DNA), but after a while the polymerase starts to run out of spare nucleotides, and the concentration growth slows down and eventually reaches a maximum.

And this, finally, is what gets us our answer. Starting with a known concentration of DNA, we can use PCR to amplify

*only*the relevant gene. Let me repeat — the only thing that gets amplified is the piece we’re interested in. If the amplification proceeds very slowly, then we can infer that the original amount of the gene was relatively low, and there’s probably only one copy of the gene present in the sample. If the amplification proceeds very quickly, then we can infer that there was a large number of gene copies present, so there must be three or more copies. And if the amplification proceeds “just right”, then we can infer that there were probably two copies of the gene.

Of course, this is all a ton of handwaving. What is “very quickly”? What is “very slowly”, or “just right”? What are we comparing to?

In order to have a point of comparison, we include some

*known*(or “control”) samples in our plate of reactions. A useful set of controls will cover the reference normal genotype (2 copies), as well as the possible likely mutations flanking the normal copy number: zero, one, and three copies. These control samples give us a good comparison point, so we can use the amplification data to reliably determine the copy number of the relevant gene in each of the patient samples in the batch.

## From Theory to Practice

Now that we’ve covered the biological theory behind what we’re doing, let’s work through how we can really use this data to detect disease carriers.

```
import json
def load_json(filename):
with open(filename, "r") as data_file:
data = json.load(data_file)
return data
# Load our sample data.
data = load_json("Amplification-Data.json")
# What's in this file, anyways?
print data.keys()
```

`cycles`

, a list of cycle indices, `sample`

, a list of amplification values of the sample we’re examining at each cycle, and `reference`

, a list of amplification values for the reference gene at each cycle. We need to include the reference normal genotype in our qPCR in order to have something to compare to: without it, we wouldn’t be able to normalize out the effects of varying qPCR efficiency and different initial concentrations of DNA. By comparing to a normal reference, though, we can differentiate between unusual qPCR efficiency and unusual copynumber, since the copynumber would apply only to the sample while the efficiency would affect the reference genes as well.
```
cycles = data['cycles']
sample_amplification = data['sample']
reference_amplification = data['reference']
print 'First five...'
print 'Cycles:', cycles[0:5]
print 'Sample Amplifications:', sample_amplification[0:5]
print 'Reference Amplifications:', reference_amplification[0:5]
```

(Note that in this case, we have data for every cycle, so

`cycles`

is just equal to `range(n)`

, but in general this may not be true.)Let’s graph these, and see what we have.

```
import matplotlib.pyplot as plt
# Display data in three separate plots.
fig, (top, middle, bottom) = plt.subplots(3, 1, sharex=True)
bottom.set_xlabel("Cycle Number", fontsize=13)
ylimits = (-0.5, 3)
for axis in (top, middle, bottom):
axis.set_ylim(ylimits)
# The top plot is the gene amplification data.
top.plot(cycles, sample_amplification, 'ro')
top.set_title("Gene Amplification Data")
# The bottom plot is the reference gene amplification data.
bottom.set_title("Reference Gene Amplification Data")
bottom.plot(cycles, reference_amplification, 'b.', figure=fig)
# The middle plot is both overlaid on top of each other.
middle.set_title("Amplification Data Comparison")
middle.plot(cycles, sample_amplification, 'ro')
middle.plot(cycles, reference_amplification, 'b.')
fig.set_size_inches(5, 8)
```

Let’s fit this data to logistic functions, so we have a function to work with instead of individual data points.

```
from scipy.optimize import curve_fit
import numpy as np
def logistic(x, vshift, hshift, width, scale):
"""A general logistic curve."""
x = np.asarray(x)
return vshift + scale / (1 + np.exp(-width * (x + hshift)))
def fit_logistic(x, y):
"""Fit a logistic curve to the data."""
# Estimate some initial parameters.
init_vshift = np.min(y)
init_hshift = 30
init_width = 0.01 / np.std(y)
init_scale = np.max(y)
# Fit to our data using a least squares error metric, starting at the initial guesses.
init_params = (init_vshift, init_hshift, init_width, init_scale)
param_opt, param_cov = curve_fit(logistic, np.asarray(x), np.asarray(y), init_params)
return param_opt
# Fit gene in question and reference gene amplification data.
gene_logistic_params = fit_logistic(cycles, sample_amplification)
gene_fit_values = logistic(cycles, *gene_logistic_params)
ref_logistic_params = fit_logistic(cycles, reference_amplification)
ref_fit_values = logistic(cycles, *ref_logistic_params)
# Add the fits to the plots.
top.plot(cycles, gene_fit_values, 'k-')
bottom.plot(cycles, ref_fit_values, 'k-')
middle.plot(cycles, gene_fit_values, 'k-')
middle.plot(cycles, ref_fit_values, 'k-')
display(fig)
```

First, we’re going to reduce every sample and reference gene pairing to a single number. We have four reference genes total, and 86 samples, so we’re going have 4 * 86 = 344 pairings total, each of which will serve as a data point. Each data point will be represented by just one value, which is going to be the difference in time (measured in cycles) between how long it takes the sample to reach a certain amplification point and how long it takes the reference gene to reach that point.

For clarity, let’s go over that again. We’re going to choose some value on the y-axis of the graphs above to be our amplification threshold. Then, we’re going to use our fitted curves to determine at which cycle the sample reaches that threshold and at which cycle the reference gene reaches that threshold. Finally, each sample-reference pair will be represented by a single value, which is the different of those cycle times. Let’s compute this for the example well we’ve been using.

If our amplification threshold is some value

*y*, we’d like to know for which value of

*x*our logistic

*f(x)*is equal to

*y*. We’ve fit values of our parameters

*v*,

*s*,

*w*, and

*h*using

`curve_fit`

, so we know the analytical expression for our logistic.Therefore, we can pretty easily compute the inverse function, which will get us the

*x*value from the

*y*value.

```
amplification_threshold = 1.75
def compute_cycle(y_value, vshift, hshift, width, scale):
"""Compute the cycle at which this logistic reaches the ampliciation threshold."""
ct = - (1/width) * log(scale / (y_value - vshift) - 1) - hshift
return ct
gene_ct = compute_cycle(amplification_threshold, *gene_logistic_params)
ref_ct = compute_cycle(amplification_threshold, *ref_logistic_params)
# Draw the threshold in green on the plots.
for axis in (top, middle, bottom):
axis.hlines(amplification_threshold, 0, 40, color='g')
# Draw where the fits hit the threshold.
top.vlines(gene_ct, *ylimits, color='y')
bottom.vlines(ref_ct, *ylimits, color='y')
middle.vlines(gene_ct, *ylimits, color='y')
middle.vlines(ref_ct, *ylimits, color='y')
display(fig)
```

```
# Compute the delta_ct value for this sample / reference gene pair.
delta_ct = gene_ct - ref_ct
print 'Delta_ct:', delta_ct
```

`delta_ct`

for each sample and reference pair. Next, we’d like to use these `delta_ct`

values to classify each sample as a 1-copy, 2-copy, or 3-copy sample.
## Sample Classification with Mixture Models

Let’s start by loading the `delta_ct`

values for an entire batch.

```
# Load our delta_ct values for an entire batch.
data = load_json("Delta-Ct-Data.json")
# What's in this file, anyways?
print data.keys()
```

`delta_ct`

s for two different replicas actually mean two different things.We can take a look at what the

`delta_ct`

distribution is like for each replica:
```
def choose_axis(axes, replicate):
"""Choose which subplot to draw on based on the replicate name."""
row = int(replicate[-1:]) - 1
if "TERT" in replicate:
axis = axes[row][0]
else:
axis = axes[row][1]
return axis
fig, axes = plt.subplots(2, 2)
for replicate in data:
# Display a histogram of delta_ct values.
axis = choose_axis(axes, replicate)
axis.set_title(replicate + " delta_cts")
axis.hist(data[replicate], bins=30, color='b')
fig.suptitle("delta_ct distributions", fontsize=16);
fig.set_size_inches(10, 10)
```

`delta_ct`

values to samples with two copies. On either side of the two copy cluster we see the one and three copy clusters, both of which have a much lower representation because having a deleted or extra gene is a rare mutation.In order to know precisely where we expect the 1-, 2-, and 3-copy clusters to be, we can use the controls we included. We have two controls of each type, and we can take the mean

`delta_ct`

value of those two controls (for each reference gene), and guess that those are approximately at the center of their respective clusters.Let’s take a look at these controls.

```
# Load the data for the controls.
controls = load_json("Controls-Data.json")
# What's in this file, anyways?
controls
```

`delta_ct`

value for each copy number. Let’s see where these lie in the histograms.
```
def choose_means(controls, replicate):
"""Get the control sample delta_ct means for this reference gene."""
if "TERT" in replicate:
means = controls['TERT']
else:
means = controls['RNaseP']
return {int(copynum_str): val for copynum_str, val in means.items()}
all_lines = []
for replicate in data:
axis = choose_axis(axes, replicate)
means = choose_means(controls, replicate)
# Display the means on the histograms in different colors.
lines = []
for copynumber, color in [(1, 'r'), (2, 'y'), (3, 'g')]:
# Have the vertical lines take up the entire vertical space.
limits = axis.get_ylim()
line = axis.vlines(means[copynumber], *limits, linewidth=4, color=color)
lines.append(line)
# Make sure the vertical lines don't cause rescaling.
axis.set_ylim(*limits)
all_lines.extend(lines)
fig.legend(lines, ("1-copy", "2-copy", "3-copy"), loc=(0.825, 0.75))
display(fig)
```

We’re almost done — the last step is to assign each sample to one of the three clusters we see, and use those to classify each sample as a 1-copy, 2-copy, or 3-copy sample. We’ve found that an incredibly robust method of doing that is to fit a Gaussian Mixture Model to these data.

Essentially, we’re assuming that our data is coming from a distribution that is composed of three different normal (Gaussian) distributions, each with its own mean and standard deviation. In addition, each normal distribution has a

*weight*, representing how much of the total data is coming from that distribution. Since we have three clusters, that gives us a total of eight parameters: for the means of our three clusters, for their standard deviations, and finally for their respective weights. (That looks like nine parameters, but it’s actually only eight, because we constrain the weights such that that , since each data point must come from one of the three distributions.)

If we knew beforehand which distribution each sample came from, then fitting these parameters would be trivial. But we don’t! So in addition to these eight distribution parameters, we need to use our data to figure out which normal component of the mixture model each of the samples came from.

The reason this problem is difficult is that in addition to fitting our label variables (the label of a sample indicates which distribution generated it), we have several

*unobserved*variables that are responsible for generating our labels. (The unobserved variables are the mus, sigmas, and

*w*s, since we have no way of sampling them directly.) If we just had one set of variables, this would be much simpler, but the fact that we have to distinct ones makes this tricky. But we’re in luck — there is a standard statistical method known as expectation maximization for solving these sorts of problems.

Expectation Maximization (EM) is an algorithm which starts out with initial guesses for the latent (unobserved) parameters of the distribution, and then iteratively improves them. During the expectation step, we can use our current Gaussian parameters estimates to classify each sample into one of the distributions. During the maximization step, we adjust our Gaussian parameter values based on the current sample labels. We can repeat these two steps several dozen times, and eventually, we’re guaranteed that the algorithm will converge to a (local) maximum of the likelihood. That is, it’ll find a set of parameters and labels that are self-consistent.

Although the final results of our EM algorithm can depend pretty heavily on our initial parameter guesses, our control samples give us a very reasonable starting point, since their copy numbers and

`delta_ct`

are known and are likely to be near their respective clusters. With that in mind, let’s go ahead and implement our mixture model. (Where by “implement” I of course mean “use the awesome scikit-learn library”!)
```
from sklearn import mixture
from scipy.stats import norm
models = {}
for replicate in data:
# Create a initial locations and a guess at the weights.
means = choose_means(controls, replicate)
init_means = np.asarray(means.values()).reshape((3, 1))
init_weights = (0.05, 0.8, 0.15)
# Fit a mixture model using these initial locations.
# We use standard initialization procedures for estimating the cluster variances.
gmm = mixture.GMM(n_components=3, init_params='c')
gmm.means_ = init_means
gmm.weights_ = init_weights
gmm.fit(data[replicate])
models[replicate] = gmm
```

```
# Gather all predictions.
predictions = {}
for replicate, values in data.items():
predictions[replicate] = models[replicate].predict(values)
# Convert this into one tuple per sample.
sample_predictions = zip(*predictions.values())
def assign_prediction(replicate_preds):
"""Only assign a prediction if all replicates agree."""
first_prediction = replicate_preds[0]
all_identical = all(pred == first_prediction for pred in replicate_preds)
if all_identical:
return first_prediction
else:
return None
classifications = map(assign_prediction, sample_predictions)
```

```
# Recall that the 0-th component is the 1-copy samples.
print '1-copy samples:', classifications.count(0)
print '2-copy samples:', classifications.count(1)
print '3-copy samples:', classifications.count(2)
print
print 'Failed to classify:', classifications.count(None)
```

```
def recolor(counts, bins, batches, values, classifications):
"""Recolor a histogram according to the sample classifications."""
for patch, right, left in zip(patches, bins[1:], bins[:-1]):
# Get sample classifications, and count how many are in each class.
classes = np.extract(np.logical_and(values >= left, values <= right), np.asarray(classifications))
class_counts = [np.count_nonzero(classes == cls) for cls in [0, 1, 2, None]]
# Find out which class is the plurality, and choose the color based on that.
plurality = class_counts.index(max(class_counts))
color = ['r', 'y', 'g'][plurality] if plurality is not None else 'k'
patch.set_facecolor(color)
fig, axes = plt.subplots(2, 2)
for replicate in data:
# Display a histogram of delta_ct values.
axis = choose_axis(axes, replicate)
axis.set_title(replicate + " delta_cts")
counts, bins, patches = axis.hist(data[replicate], bins=30)
recolor(counts, bins, patches, np.asarray(data[replicate]), classifications)
fig.suptitle("labeled delta_ct distributions", fontsize=16);
fig.legend(lines, ("1-copy", "2-copy", "3-copy"), loc=(0.825, 0.75))
fig.set_size_inches(10, 10)
```

*Andrew Gibiansky is a software engineering intern at Counsyl. He’s*

*currently attending Harvey Mudd College and majoring in*

*Mathematics. He keeps his own sparsely-updated blog and enjoys machine learning, data analysis, mathematical modeling and simulat*

*ion, and anything that gives him a reason to play with Haskell.*