0%

A Sequence of Letters

Object Lessons from the Human Genome Project
By Tim Fessenden|

I

What kind of object is a DNA sequence, separate from the elements that constitute it?

Starting in the mid-1980s, biology researchers with an affection for codes and sequences began a long endeavor to define the entire sequence of the human genome. The Human Genome Project was a decades-long series of meetings, research publications, and funding decisions that pushed and prodded genetic research techniques, computer algorithms, technologies for storing data, bacteria, yeast, humans, insects, and mice. These same lively technologies and critters tugged, beguiled, and shocked their human collaborators. Human and nonhuman actions together produced a new object: the complete sequence of the human genome.

The birth of this genetic sequence can be dated very exactly to April 14, 2003, the date on which an “essentially complete” genome was announced1 by the Human Genome Project. Yet this achievement had been preempted on June 25, 2000, when President Bill Clinton and Prime Minister Tony Blair announced the complete “rough draft” of the genetic code in a joint press conference2. The rough draft had some spelling errors and was missing some sections, as though one had left an introductory or concluding sentence to be filled in later. But they were too excited by the rough draft to let it pass uncelebrated.

This object is not the same as your DNA. It is a wobbly lynchpin in a messy and crowded scientific definition of “human-ness.” The human genome is an ongoing collaboration between the very capable goo inside the cells throughout our bodies and the human scientific regime that chemically separated its elements, called nucleotides. It is also quite simply a sequence of 2.9 billion of these nucleotides (A, T, G, and C only, please). The uses and meanings piling up on this sequence originate from the genetic code, a dramatic human interpretation of DNA sequences that was already around 50 years old at the birth of the human genome.

II

The US president’s remarks in June of 2000 referred to both “the language with which God created life,” as well as “one of the most wondrous maps ever produced.” Here at the glorious moment of its public debut, the human genome was a language (thanks to its code-ness) but also a map. Its distinct map-ness, its foundational interpretability and an essential ingredient in this new object, originated from an earlier map that existed separately from DNA sequences.

The Human Genome Project used several well-worn tricks to render the long slimy polymers wadded inside of living stuff legible to our human minds. Biologists had become very used to a view of large genome neighborhoods where genes resided. This larger scale did not consider the individual DNA nucleotides (much too small). Instead genetics researchers had drawn maps of neighborhoods - called linkage maps - and methods to study them. Linkage maps are linear, ordered assemblies of genes or genetic elements, like the stops on a train line.

Linkage maps are wholly different objects from genetic sequences. Starting roughly with Thomas Morgan in the 1920s, linkage maps consider the genome as a series of genes that were more or less likely to be inherited together (linked). They had their own unit of distance (the centimorgan), which was measured by heredity. A gene’s placement on this map only somewhat corresponded to the number of letters that separated it from neighboring genes. For this reason linkage maps diagrammed genomes at a much lower, fuzzier resolution than the letter-by-letter sequence. The embodied, material nature of genomes as sequences of chemically linked nucleotides only came after the spree of sequencing efforts of which the Human Genome Project was the start. A complete human linkage map was a necessary step, a scaffolding, along the way to produce the full sequence of the human genome. It was completed six years before the rough draft of the genome was announced, at which time Francis Collins, lead of the Human Genome Project, intoned simply3, “We have a good map”.

The work of generating this new thing called the sequence of the human genome was just as much a reconciliation of maps at different scales as a journey of pure discovery. The labor of sequencing itself, not to be discounted, would still have been meaningless without this scaffolding.

III

The reconciliation at the core of this object had to align two systems of measurement: one based on inheritance (the linkage map) and the other based on DNA sequences. The latter were generated in very small bits at a furious rate over the course of the project. The process of unifying them into one object, the human genome, required another essential ingredient: alignment algorithms.

For at least a decade before the human genome project started, researchers had studied the similarity of DNA sequences plucked from different species, or from different genetic bits in the same critter. The practice of matching sequences based on similarity drew on alignment algorithms from math and from constantly improving computing power. Alignment algorithms were not just needed to detect perfect matches. Often DNA sequences are more or less similar but not identical, with a smear of similar-ness that might reflect differences between organisms, differences between bits of DNA within one genome, or (frankly less exciting for most biologists) differences between individual humans. Alignment is analog, qualitative, and elastic. Those who loved DNA sequences built algorithms to accelerate its calculation but also to express its shades and magnify its degrees.

Alignment scores conveyed a distance between more or less similar sequences. Here distance is in informational space and corresponds to how many changes would be required to convert one sequence into the other. Speed of computation was also essential, and algorithms were developed to match computing architecture using parallel processing. A landmark was the release of the Basic Local Alignment Search Tool (BLAST) in 1990. The importance of this tool, still with us today (https://blast.ncbi.nlm.nih.gov/Blast.cgi), is shown by its use in language as a verb: “to BLAST” a DNA sequence.

Producing the human genome demanded a huge effort of sequencing, and the nature of this object appears to us as a sequence of letters. Alignment harnessed the many small bits of new sequences churned out by the project and moved them to their final assignment along the linkage map from which scientists were working. Alignment was the logic, the central cognitive labor reiterated millions of times. Alignment thinks in terms of similarity scores and focuses us on a different dimension for measuring distances: informational distance. The work of the human genome project was reconciliation, conversion, and translation. It united three distances, a cubing of systems of measurement: physical distance (the sequence itself), genetic distance (the linkage map), and informational distance (alignment).

Today, the human genome is easily accessed by a number of web resources, all of which dazzle with informational baubles that relate human biology to the raw sequence. The thing itself, this particular sequence, lives quietly in the background behind layers of code, interpretation, conversion, and comparison. These websites, contemporary portals to the human genome, reverberate with the convergence of those three dimensions of distance: physical, genetic, and informational.

Tim Fessenden is a scientific editor at the Journal of Cell Biology. He lives in Queens.

Notes

  1. https://www.nytimes.com/2003/04/15/science/once-again-scientists-say-human-genome-is-complete.html

  2. https://archive.nytimes.com/www.nytimes.com/library/national/science/062600sci-human-genome.html?amp;sq=francis%252520collins&st=cse&scp=23

  3. https://www.genome.gov/sites/default/files/media/files/2021-02/1994_First_Project_Goal_Met.pdf&ved=2ahUKEwi8rJn1_7aHAxV7MVkFHTfGBDkQFnoECBwQAQ&usg=AOvVaw0b_r07ofQaz9AgH6fMZaBF


More Articles