13 Pages

Universal DNA Tag Systems: A Combinatorial Design Scheme


Gain access to the library to view online
Learn more


Niveau: Supérieur, Doctorat, Bac+8
Universal DNA Tag Systems: A Combinatorial Design Scheme Amir Ben-Dor? Richard Karp† Benno Schwikowski‡ Zohar Yakhini Abstract Custom-designed DNA arrays o?er the possibil- ity of simultaneously monitoring thousands of hy- bridization reactions. These arrays show great potential for many medical and scientific applica- tions such as polymorphism analysis and genotyp- ing. Relatively high costs are associated with the need to specifically design and synthesize problem- specific arrays. Recently, an alternative approach was suggested that utilizes fixed, universal ar- rays. This approach presents an interesting de- sign problem—the arrays should contain as many probes as possible, while minimizing experimen- tal errors caused by cross-hybridization. We use a simple thermodynamic model to cast this de- sign problem in a formal mathematical framework. Employing new combinatorial ideas, we derive an e?cient construction for the design problem, and prove that our construction is near-optimal. 1 Introduction Oligonucleotides are short single-stranded pieces of DNA (typically 15-50 nucleotides) made by chem- ?Department of Computer Science & Engineering, Univer- sity of Washington. Supported by the Program in Mathematics and Molecular Biology (). †International Computer Science Institute and Mathemat- ical Sciences Research Institute, University of California at Berkeley. Supported by NSF grant at the University of Wash- ington (karp@icsi.

  • tat system

  • such greedy

  • design problem

  • solution-phase hybridization

  • specific part ligated

  • dna

  • facilitate sort- ing

  • sort into

  • pairs such



Published by
Reads 22
Language English

Presented at RECOMB 2000, extended version appeared in Journal of Computational Biology
Universal DNA Tag Systems: A Combinatorial Design Scheme
∗ † ‡ §Amir Ben-Dor Richard Karp Benno Schwikowski Zohar Yakhini
Abstract ical synthesis. In solution, oligonucleotides tend to
specificallyhybridize with their Watson-Crick com-
Custom-designed DNA arrays offer the possibil- plements ([21]), and form a stable DNA duplex.
ity of simultaneously monitoring thousands of hy- This specificity is exploited in molecular hybridiza-
bridization reactions. These arrays show great tion assays, in which oligonucleotides are used as
potentialfor many medicaland scientific appilca- probes to identify any complementary (or near-
tions such as polymorphism analysis and genotyp- complementary) DNA from a complex mixture of
ing. Relatively high costs are associated with the target DNA.
need to specifically design and synthesize problem- Array-based hybridization assays, introduced in
specific arrays. Recently, an alternative approach the late 1980s [6, 13, 15, 3, 5, 8], offer the pos-
was suggested that utilizes fixed, universal ar- sibility of simultaneously monitoring a multitude
rays. This approach presents an interesting de- with(currently up to tens of thousands) of hy-
sign problem—the arrays should contain as many bridization reactions. In such an assay, a target-
probes as possible, while minimizing experimen- specific set of oligonucleotides is synthesized on a
talerrors caused by cross-hybridization. We use solid support surface (e.g., silicon or glass). A fluo-
a simple thermodynamic model to cast this de- rescently labeled target sample mixture of DNA or
sign problem in a formal mathematical framework. RNA fragments is then brought in contact with the
Employing new combinatorial ideas, we derive an treated surface, and allowed to hybridize with the
efficient construction for the design problem, and synthesized oligonucleotides. Scanning the fluores-
prove that our construction is near-optimal. cent labels of the fragments attached to the array
reveals information about the content of the sam-
ple mixture. Theoretically, the assay conditions1 Introduction
are such that hybridization only occurs in sites on
Oligonucleotides are short single-stranded pieces of the surface that are Watson-Crick complements to
DNA (typically 15-50 nucleotides) made by chem- some substring in the target. In practice, cross-
∗ hybridizationis a main source of cross-signalcon-Department of Computer Science & Engineering, Univer-
sity of Washington. Supported by the Program in Mathematics tamination in any array-based hybridization assay.
and Molecular Biology (amirbd@cs.washington.edu). Array-based hybridization assays show great†InternationalComputer Science Institute and Mathemat-
icalSciences Research Institute, University of Caifol rnia at potentialfor many different appiclations such as
Berkeley. Supported by NSF grant at the University of Wash- SNP genotyping [12], gene expression profiling
ington (karp@icsi.berkeley.edu).
‡ [4], and resequencing DNA [14, 12]. Recently,Department of Computer Science & Engineering, Univer-
sity of Washington. Supported by the German Academic Ex- S. Brenner and others [1, 2] suggested an alterna-
change Service (DAAD) (benno@cs.washington.edu).
§ tive approach based on universal arrays contain-Chemicaland BiologicalSystems Department, Agilent Lab-
oratories, a Hewlett-Packard Subsidiary (zohary@hpl.hp.com). ing oligonucleotides called antitags.TheWatson-
Crick complement of each antitag is called a tag.
The tag–antitag pairs are designed so that each
tag hybridizes strongly to its complementary an-
titag, but not to any other antitag. In this ap-
proach, the analysis of a DNA sample consists oftwo steps: solution-phase hybridization followed by 2. When an individualis to be genotyped, a
solid-phase hybridization. In the first step, hy- sample is prepared that contains the se-
bridization takes place between the target DNA quences flanking each of the SNP loci.
in solution and a set of oligonucleotide precursors The sample is mixed with the reporter
called reporter molecules. Each reporter molecule molecules. Solution-phase hybridization then
consists of a target-specific part ligated to a unique takes place. Assuming that specificity is per-
tag. Reporter/target hybridization events are reg- fect, this results in the flanking sequences of
istered (e.g by an enzymatic reaction). In the sec- the SNPs paired only with the appropriate
ond step the modified precursors are introduced reporter molecule.
to the array. Tags form duplexes with the corre-
3. Single nucleotides, A,C,T,G, fluorescently la-sponding antitags. Thus, the reporter molecules
beled with four distinct colors, are added toare sorted into different locations on the array and
the mixture. These labeled nucleotides hy-hybridization events can be determined. This ap-
bridize to the polymorphic site of each SNPproach has severaladvantages:
and are ligated to the corresponding reporter
• Complicated array manufacturing processes molecule. That is, each reporter molecule is
are required only for the fixed, universal com- extended by exactly one labeled nucleotide.
ponent of the assay. These universalcompo-
4. The extended reporter molecules are sepa-nents can therefore be mass-produced, signif-
rated from the sample fragments, and broughticantly reducing manufacturing costs.
intocontact with theuniversalarray. Assum-
• The assay components that need to be de- ing that specificity is perfect, the tag part of
signed for a specific target are involved in so- each reporter molecule will only hybridize to
lution phase processes. The underlying nu- its complementary antitag on the array. Thus
cleic acid chemistry and thermodynamics are the extended reporter molecules sort into the
better understood than the same aspects of array sites where the corresponding antitag is
surface-based processes. Therefore a more ef- present.
ficient and effective design process is facili-
5. For each site of the array, the fluorescent col-tated.
ors present at that site are detected. The col-
As an example, we describe a multiplexed SNP ors indicate which bases were used for the ex-
genotyping assay. SNPs (single nucleotide poly- tension at the corresponding SNP site, and
morphisms) are differences, across the population, thus revealthe SNPvariationspresent inthe
in a single base, within an otherwise conserved ge- individual.
nomic sequence [9]. Genotyping is a process that
The design problem for a DNA TAT systemdetermines the variants present in a given sample,
presents a tradeoff. Clearly, it is desirable to haveover a set of SNPs. This assay uses off-the-shelf
as many tags as possible, in order to maximize theuniversalcomponents: a universalset of oilgonu-
number of SNPs that can be genotyped in parallel.cleotide tags and a universal array of antitags. The
On the other hand, if too many tags are used, sim-antitags, immobilized on the array, are Watson-
ilar tags will necessarily entail cross-hybridizationCrick complements of the tags in the mixture. The
events (where tags hybridize to foreign antitags),whole system will be called a DNA Tag/AntiTag
reducing the accuracy of the assay.system and in short a DNA TAT system. Con-
This design problem was identified in previoussider a set of SNPs to be genotyped. The assay is
work and several formulations and solutions wereperformed as follows (see Fig. 1):
proposed [10, 1, 2, 18, 11]. These papers differ
1. A set of reporter molecules (one for each both in the way hybridization is modeled, and in
SNP) is synthesized in solution. Each re- the algorithmic approach employed to find a good
porter molecule consists of two parts that are DNA TAT system. In [10] a TAT system is de-
ligated (in string language: concatenated) to- scribed as a part of a strategy for surface-based
gether. The first part is the Watson-Crick DNA computing. The authors take a coding theory
complement of the upstream sequence that approach and choose to modelcross-hybridization
immediately precedes the polymorphic site of constraints as generalHamming distance condi-
the SNP. The second part of each reporter tions. A set of 108 8-mers, with a 50%G/C content,
molecule is a unique tag from the universal which differ in at least 4 bases from each other, is
set of tags.Fragments spanning the
polymorphism sites for all
the SNPs in the set are
extracted. The different
shapes denote different
Oligonucleotides complementary to the sequences
immediately preceding the polymorphism sites are
tagged by DNA tags, designed to specifically
hybridize to their complements on the array.
Extension reactions take place in solution phase,
in the presence of a mixture of all four dydeoxy
nucleotides (differentially fluorescently labeled)
and an appropriate enzyme. For each SNP the
extending base is the one complementary to the
one corresponding to the base present in the
sample sequence. After separation (the whole
process can be performed in high temperature) a
mixture of reporter molecules is formed. This
mixture is brought in contact with the array.
Tags hybridize to their complements and a
fluorescence pattern is obtained from which the
identity of all variants in the original mixture can
be deduced.
Figure 1: Schematic for SNP genotyping using a DNA TAT systemconstructed, and experimentally tested for cross- this modelto deriveaformalTATdesign problem.
hybridization. DNA duplexes are held together by weak hydro-
In [1] the method of using a DNA TAT system gen bonds formed between Watson-Crick comple-
to sort target DNA is presented, together with sev- mentary nucleotides. Denaturation (or melting)
eral examples of applications. The model assump- of a DNA duplex is generally achieved by heat-
tion is that two oligonucleotides of length n need to ing the solution to a temperature which disrupts
have perfectly complementary substrings of length the hydrogen bonds. The energy required to sep-
more than λ in order to form a reasonably stable arate complementary DNA strands is dependent
duplex. A set of n-mersissaid to bea λ-freecode on a number of factors, notably: strand length—
if no two elements of the set have a common sub- longer duplexes contain a large number of hydrogen
string of length more than λ.Given n, the design bonds and require more energy to separate them—
problem implied in [1] is to construct the largest and base composition, because C–G pairs have one
possible λ-free code. more hydrogen bond than A–T pairs, strands with
A De Bruijn sequence of order λ is a cyclic se- higher C–G content are more difficult to separate
quence in which each possible λ-mer occurs exactly than those with low C–G content [20].
once [7]. In [2] the authors observe that by pars- As the assay temperature increases, the prob-
ing a De Bruijn sequence of order λ,an optimal ability that a duplex will be separated increases.
λλ-free code, of size 4 /(n−λ+1) can be obtained. A usefulmeasure of the hybridization behavior of
However, the authors also recognize the shortcom- two oligonucleotides U and V is theirmeltingtem-
ing of their highly simplified mathematical model. perature t (U,V ) (in degrees Celsius). This is theM
Tocapture cross-hybridization,amodelhastoem- temperature at which, under stated experimental
body thermodynamic properties of DNA duplexes. conditions, half of the U and V oligonucleotides
The work we present here improves on this aspect will be in a single-stranded form, and half will oc-
by using De Bruijn sequences in a different way. cur in duplexes.
Additionally, the authors of [2] suggest a Let f denote the fraction of duplexes formed
greedy approach to designing DNA TAT systems— at a specific array site between an immobilized
starting with an empty set and iteratively adding oligonucleotide and a fluorescently labeled tag. We
a new tag to it, provided it does not hybridize assume that we are given two design parameters
with any of the complements of the tags already in- 0≤ α≤ β ≤ 1 such that:
cluded. This approach clearly allows the use of ar-
• Iff>β, the resulting fluorescent level is in-bitrarily complex thermodynamical models. How-
terpreted as a positive result.ever, there is no analysis for the performance of
such greedy heuristics. • Iff<α, the resulting fluorescent level is re-
Recently, a small prototype universal array was ported as a negative result.
used to detect K-ras mutations in tumor and cell
Moreover, we assume that the experiment is per-line DNA [11]. The results reported suggest that
formed at temperature t. The parameters α anduniversal arrays may be used to rapidly detect low-
β can be translated into two temperature param-abundance mutations in any gene of interest.
eters, C and H (C<t<H) with the followingIn the context of hybridization arrays the DNA
property: If a duplex has a melting temperatureTAT system is used in order to facilitate sort-
of at most C, in the experiment (done at temper-ing molecules into defined physical locations. The
ature t), at most a fraction of the duplexes willsame idea is also applicable for other addressing
form. Similarly, if a duplex has a melting temper-systems such as coded beads.
ature of at least H, then at temperature t,at leastIn this paper we use a simple thermodynamic
afraction δ of the duplexes will form.modelof hybridization to give a precise formual-
The designgoalisto construct a largeTATsys-tion of the TAT system design problem. We em-
tem such thatploy new combinatorial ideas to provide an efficient
construction and prove that our solution is near- ¯1. for each tag U, t (U,U)≥ H,andM
¯2. for any two distinct tagsU andV , t (U,V ) <M
C.1.1 Thermodynamicmodel
Such a system, if used with temperature t,wouldIn this section we describe a simple thermody-
allow each tag–antitag hybridization to be de-namic modelof DNA duplex formation, and use
tected, without any cross-hybridization errors.Notethat, ina practicalapplication, C can be low- complementary to each other, the above require-
ered and H can be increased to provide a buffer ment can be stated in terms of solely the tag se-
against parameter inaccuracies and limitations in quences.
the melting temperature model used, imperfect
temperature control, and further experimental bi- 2 TATsystemdesignproblem
For estimating the melting temperature of a du- We can now formally define the TAT system design
plex between a tag and its antitag we employ the problem. Our goal is to construct a TAT system
2–4 rule that is commonly used for short oligonu- with a maximum number of tag–antitag pairs such
cleotides [20]: The melting temperature of a se- that the following properties are satisfied:
quence and its complement is approximately twice
¯(1) For each tag–antitag pair (U,U)the melt-the number of A–T base pairs plus four times the
ing temperature (using the 2–4 rule) satisfiesnumber of C–G base pairs. This model also enables ¯t (U,U)≥ HMus to state a condition that is sufficient to exclude
cross-hybridization errors. The sufficient condition (2) For any two distinct tags U and V,and for
is derived from a characterization of the duplex for- each oligonucleotide x that occurs as a sub-
mation process by Southern et al. [19] string in both U and V , t (x,x¯)<C.M
“...the process begins by the formation We now present a conservative formalization of
of a transient nucleation complex from the above TAT design problem. Specifically, we
the interaction of very few base pairs. not only forbid that a string x with t (x,x¯) ≥ CM
Duplex formation proceeds, one base occurs in any two distinct tags, but we also forbid
pair at a time, through a zippering pro- that it occurs twice in the same tag. This mild re-
cess. At any point the reaction may striction has also been imposed in previous work [2]
go in one of two directions, pairing or and allows a rigorous analysis and a near-optimal
separation: if bases are complementary constructive solution.
and freely available for pairing, duplex As usual, we model oligonucleotides as strings
formation is more likely to proceed; if over the alphabet Σ ={A,C,T,G}. We assume that
bases are non-complementary or a sta- the parameters C and H are fixed. To keep our ex-
ble structure inhibits base pair forma- position simple, we assign to each string the num-
tion, the block to the zippering process ber that corresponds to half of its melting temper-
may drive the nucleation complex to fall ature in degrees Celsius.
apart. Duplex formation, and hence du-
The weight w(s) of a string s =plex yield, will be determined by the sta- kbility of the nucleation complex and of a a ···a is w(a ),wherew(A)= w(T)= 11 2 k ii=1
intermediates up to the point in the zip- and w(C)= w(G)=2. Giventwoparameters cand
pering process where the likelihood of h,wecallasetT ofstringsor“tags”a validc−h
strand separation is negligible.” code if the followingtwoconditions aresatisfied:
Based on the above characterization and the 2– Eachtaghasaweightof hor more.
4 rule, we make the following assumptions.
Any substring of weight c or more
1. Stable hybridization is always initiated by the occurs at most once.
formation of a nucleation complex.
Note that a valid c–h code corresponds to a
2. A nucleation complex is a region of perfect solution of the TAT design problem in which the
base pairs between the tag and the foreign lower melting temperature C is 2c and the upper
antitag. melting temperature H is 2h.We call the problem
of finding a maximum valid c–h code the
bi-3. The melting temperature of the nucleation
. See Fig. 2 forcomplex can be computed according to the
an example of a valid code.2–4 rule.
In the next section we derive an upper bound
Following these assumptions, we aim at avoid- that, in particular, implies that the code in Fig. 2
ing the situation where a tag contains a substring is optimal. That is: there exists no valid 4–10 code
x and a non-complementary antitag contains x¯, with more than 12 tags.
where t (x,x¯) ≥ C. Since tags and antitags areM
DesignGACCAAT CAGCTAT GTCGATA CTGGTTA weight to the tailweight of T. The first two char-
CATTATCA GAAATTCT CTTAATGA GTATTTGT acters do not terminate a token because they do
ATATAGTG TAAAACTC AATAAGAG TTTTACAC not terminate a suffix of weight ≥ c. In the gen-
eralcaseofatag T in a validc–h code, the maximal
Figure 2: A valid 4–10 code with 12 tags prefix that does not contain a suffix of weight≥ c
hasa totalweightofatmost c−1. Since the weight
of a tag in a valid code is at least h (Condition 1),
3 Upperbound we have the following lower bound.
Any tag in a valid c–h code has a tailIn this section we derive a tight upper bound for
weight of at least h−c+1.the number of tags in a valid c–h code. The idea
is to associate a numericalresource with each tag. Based on Conditions 1 and 2’, we now derive
We show that any tag has to use a certain minimum the upper bound on the totaltailweight of a valid
amount of resource, and the global resource usage c–h code. We use <n> to denote the set of strings
over all tags cannot exceed a certain maximum. with weight n∈N,and G to denote the numbern
The upper bound on the number of possible tags of such strings. It is straightforward to derive the
then follows from dividing the global upper bound recurrence G =2, G =6, and G =2· G +1 2 n n−2
by the minimum amount of resource used by each 2· G for n ≥ 3, and for the sake of simplicityn−1
tag. we define G := 1. Using standard techniques for0
Roughly speaking, the limited resource we con- solving recurrences it can be shown that, for all
sider consists of those substrings in a tag with a n∈N, G is the nearest integer ton
weight of c or more that can occur only once in a √ √ n3+ 3valid c–h code. The following definition captures · 1+ 3 .
6the minimalsuffixes that can occur only once in a
valid c–h code.
tokens in a valid code, we partition tokens into four
n2 We call a string t a c-token if
classes. In our symbolic representation of thesew(t) ≥ c,but t does not properly contain a suffix
classes, we denote any character with a weight ofof weight ≥ c.
1(A or T)by W (“weak”) and a character with a
It is straightforward to see that a substring of weight of 2 (C or G)by S (“strong”). To see how
weight≥ c occurs twice ifand only ifsomec-token the set of tokens is partitioned, observe that any
occurs twice. We can therefore replace Condition 2 token is terminated by either a strong or a weak
in our problem by the following equivalent condi- character, and it has a weight of either c or c+1.
tion. Tokens of weight c + 1 begin with a strong charac-
ter, since a string of weight c+1 that begins with aAny c-token occurs at most once.
weak character properly contains a suffix of weight
c. Figure 1 lists the corresponding four classes ofHereafter, we refer to a c-token simply as a token.
tokens, their maximalcardinaitl ies, and the maxi-With each token, we associate its tail weight,the
maltotaltailweight they can contribute ina validweight of its terminalcharacter. The tailweight
code.of a tag T is the sum of the tailweights of althel
tokens it contains as substrings. Figure 3 gives an Max. occurrences Max.
example for T =GACCAAT and c=4. Token class in valid code tail weight
<c− 2>S 2·G 4·Gc−2 c−2Tag T: GACCAAT Token’s tailweight
S<c− 3>S 4·G 8·Gc−3 c−3GAC 2
CC 2 <c− 1>W 2·G 2·Gc−1 c−1
Tokens: CCA 1 S<c− 2>W 2·G 2·Gc−2 c−2CAA 1
Tailweight of T:7
Table 1: Bounds on the number of tokens and their
tailweight in a vaidl c–h codeFigure 3: Tokens and tailweight
Notice that all characters of T except the first The totaltailweight of each casl s is computed
two terminate a token and thus contribute their by multiplying its size by the weight of the termi-
.nal character of its members. Only the last row In the second stage of our construction, the tags
requires additionalexpanl ation: Observe that any of our design are extracted as substrings from the
token in the class S<c− 2>W contains a token of circular strings, as illustrated in Fig. 4. To satisfy
the form S<c− 2> as a substring, and thus only Condition 1, each of the extracted substrings has
2· G tokens of this form can exist in a valid aweightof h or more. To satisfy Condition 2’, thec−2
code. Therefore the number of tokens of the form overlap between two tags has a weight of at most
S<c− 2>W cannot exceed 2· G . Summing up c− 1.c−2
the rightmost column proves the following lemma.

The total tail weight of all tags con-
tained in a valid c–h code is at most

2·G +6·G +8·G .c−1 c−2 c−3

Combining this with Lemma 1 yields the follow-
ing upper bound.

Anyvalid c–h code contains atmost

2·G +6·G +8·Gc−1 c−2 c−3
For h =10 and c = 4, the upper bound is
2·16+6·6+8·2 Figure 4: Second stage—Extracting tags from cir-= 12, which proves:
cular strings
The 4–10 code in Fig. 2 is optimal.
Tags can be extracted from a circular string by a4 Ourconstructionusingcircularstrings
straightforward greedy algorithm that iterates the
following operation. Starting at some position, theIn this section we describe a method of construct-
algorithm collects characters until their cumulativeing a nearly optimal c–h code for arbitrary values
weight reaches or exceeds h, forming one tag, andof c and h. Specifically, our code comprises at least
then tracks back over as many characters as possi-
2·G +6·G +4·Gc−1 c−2 c−3 ble without collecting a weight of c or more. This− 1 tags.
h−c+3 operation is repeated untilsome overlap of weight
≥ c with the first extracted tag occurs, and the lastComparing our code with the upper bound, and
retrieved tag is discarded. Given the best start po-using the recurrence for G , one finds that ourn
sition, this algorithm produces the largest numbermethod at least achieves a factor of approximately
of tags that are substrings of a given circular string0.89· (h− c+1)/(h− c + 3) relative to the up-
and can be included in a valid code. Observe that,per bound. For the values c =12 and h =30
since each character has a weight of 1 or 2, each tagthat can be seen as relevant in practice, our con-
extracted in this manner has a weight of at moststruction yields 12119 tags, which corresponds to
h + 1, and the overlap between two tags is at least87.6% of the upper bound of 13840 one gets from
c− 2. Therefore, each circular string C leads toTheorem 1. w(C)at least − 1 tags. This lower bound for theThroughout our exposition we will assume that h−c+3
number of tags extracted from each cycle motivatesthe parameters c and h ≥ c are fixed. In addition
us to consider the following formal problem.we will assume that c is even. The case of an odd
value of c requires only small modifications.
Giventhe parametersc> 0 andh>c,constructa
4.1 Constructionoverview set C of circularstrings that contain any substring
of weight ≥ c at most once, and maximize
Our construction proceeds in two stages. In the
first stage we construct a set of circular strings in w(C)
− 1 (1)which each token occurs at most once. The char- h−c+3
C∈Cacters of a circular string are arranged in a cyclic
order, and when convenient, we will assume that a The construction we will describe optimally
specific character is designated as its origin. solves this problem. Specifically, our construction