7.344 Genomics and bioinformatics of gene expression

MIT Advanced Undergraduate Seminar

Fall 2003


Gabriel Kreiman, PhD (kreiman@mit.edu) [From the laboratory of Tommaso Poggio] E25-201–[3-0547]

Uwe Ohler, PhD (ohler@mit.edu) [From the laboratory of Chris Burge] 68-223 – [3-7039]

Time: Wednesdays, 1-3 pm.

Place: Room 68-151

Units: 2-0-4 (P/D/F)

Abstract: A large number of both normal and disease biological processes depend on specific spatial and temporal patterns of the expression of particular genes or groups of genes. The recent availability of DNA sequence information (from humans and other organisms) as well as high-throughput methods for the analysis of gene expression data (e.g., from microarrays) allow us to use computational algorithms to study gene expression (transcription). In this seminar we will focus on transcriptional initiation, regulation, and networks and on how expression levels are measured using high-throughput techniques. We will discuss recent advances in the methods of genomics and bioinformatics available to a biologist interested in this area of research. Many of these tools also have applications in other areas of biological research.

Format: This course seeks to familiarize you with high-throughput techniques and algorithms used to investigate transcription in gene regulation and development. This will be achieved through the reading and discussion of primary research papers, and related assignments. At the end of each lecture, we will give a brief introduction to the papers assigned for the following class. While most of the papers will be recent, we will also examine a few landmark papers. We will also take a field trip in the second half of the course to the Whitehead Institute Genome Center, to see the sequencing and DNA microarray facilities where many of today’s important large-scale datasets come from.

Grading: This is a pass/fail seminar. Your grade will be determined by your attendance, participation in discussion, and your completion of the assignments.

Course work:

1) Active participation in class discussion and presentation of papers. Attendance is required for every class. If a student must absolutely miss a class, she or he shall write and hand in 1 page (12 pt, double spaced), indicating (a) a short summary comparing both papers, and (b) a critique of one aspect of one paper.

2) Two short reports (2-3 pages, 12 pt, double spaced) with a comparison of material covered in 4 previous sessions and an outline of a follow-up experiment to one paper. (1st assignment: due 10-08, 2nd assignment: due 11-12)

Classes and papers:

1 09-03 Introduction to Transcription

Basic aspects about the biology of transcription. Basic aspects and tools in bioinformatics.

No paper discussions this week

2 09-10 Introductory papers, historical perspective

A historical overview of the initial steps in understanding transcription and its regulation. We introduce here the lac operon and the paradigm-shift discoveries of Jacob and Monod in the late 50s and early 60s. We also discuss the discovery of the TATA box signal for transcription initiation.

2.1 Jacob F, Monod J. Genetic regulatory mechanisms in the synthesis of proteins. J Mol Biol 3:318-356, 1961.

2.2 Lee DC, Roeder RG, Wold WS. DNA sequences affecting specific initiation of transcription in vitro from the EIII promoter of adenovirus 2.

Proc Natl Acad Sci USA. 79:41-45, 1982.

3 09-17 Transcription factors and binding sites

An introduction to the notion of transcription factors, their binding sites and their standard representation when using computers. We will look at the first large-scale computational analysis of core promoter regions and general transcription factors, and at a recent example of interactions between core promoters and distal regulatory sites.

3.1 Bucher P. Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J Mol Biol 212:663-578, 1990.

3.2 Conkright MD, Guzman E, Flechner L, Su AI, Hogenesch JB, Montminy M. Genome-wide analysis of CREB target genes reveals a core promoter requirement for cAMP responsiveness. Mol Cell 11:1101-1108, 2003.

4 09-24 Transcription initiation.

How can we use computers to model core promoter regions and predict transcription start sites of protein coding genes? Two recent examples show that this problem is much trickier than originally thought and demonstrate the wide spectrum of models for promoter regions.

4.1 Davuluri RV, Grosse I, Zhang MQ. Computational identification of promoters and first exons in the human genome. Nat Genet 29:412-417, 2001.

4.2 Bajic VB, Chong A, Seah SH, Brusic V. Intelligent System for Vertebrate Promoter Recognition, IEEE Intelligent Systems, 17:64-70, 2002.

5 10-01 Transcription initiation, experimental validation

Turning from computational to experimental verification, we look at two examples describing the large-scale sequencing of 5’ full-length cDNAs, as well as verification of the activity of cDNA derived putative promoter regions.

5.1 Suzuki Y, Tsunoda T, Sese J, Taira H, Mizushima-Sugano J, Hata H, Ota T, Isogai T, Tanaka T, Nakamura Y, Suyama A, Sakaki Y, Morishita S, Okubo K, Sugano S. Identification and characterization of the potential promoter regions of 1031 kinds of human genes. Genome Res. 11:677-684, 2001.

5.2 Trinklein ND, Aldred SJ, Saldanha AJ, Myers RM. Identification and functional analysis of human transcriptional promoters. Genome Res. 13:308-12, 2003.

6 10-08 Cis-regulatory codes. Clusters of known elements

Single transcription factors may bind to large numbers of sites in a genome and therefore are unlikely to account for complex regulatory networks. Here we introduce the concept of combinatorial regulation and show some examples in different species and systems of clustering of transcription factors. We will also discuss computational algorithms to search for clusters of known transcription factors.

6.1 Wasserman WW, Fickett JW. Identification of regulatory regions which confer muscle-specific gene expression. J Mol Biol 278:167-181, 1998.

6.2 Berman BP, Nibu Y, Pfeiffer BD, Tomancak P, Celniker SE, Levine M, Rubin GM, Eisen MB. Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc Natl Acad Sci USA. 99:757-762, 2002.

7 10-15 Microarrays

High-throughput techniques to study gene expression currently provide large amounts of quantitative data about the transcriptional activity of thousands of genes. Here we discuss the basic principles behind microarrays and how they can be used to study transcription.

7.1 DeRisi J, Iyer V, Brown P. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278:680-696, 1997.

7.2 Eisen M, Spellman P, Brown P, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 95:14863-14868, 1998.

8 10-22 Field Trip

Visit to MIT’s Whitehead Institute Genome Center, which played a major role in the Human Genome Project, to see DNA sequencing and microarray facilities in action. We will observe up-close how some of the important data for the studies described in the class are acquired.

PAPERS: No papers for this week.

9 10-22 Motif finding

Given a set of putatively co-regulated genes, how can we infer what the possible regulatory sequences are? This problem involves the extraction of common sequence patterns from a given set of sequences and is also related to the question of multiple sequence alignments. Here we discuss two different algorithms to attempt to discover common nucleotide signals in a set of sequences.

9.1 Hughes JD, Estep PW, Tavazoie S, Church GM. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol 296:1205-14, 2000.

9.2 Bussemaker HJ, Li H, Siggia ED. Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. Proc Natl Acad Sci USA. 97:10096-10100, 2000.

10 10-29 Comparative genomics

A common principle in biology is that useful stuff remains through evolution while junk can drift randomly. This principle has been used to study the conservation of coding sequences and can also be applied to study and discover regulatory signals.

10.1 Wasserman WW, Palumbo M, Thompson W, Fickett JW, Lawrence CE. Human-mouse genome comparisons to locate regulatory sites. Nat Genet 26:225-228, 2000.

10.2 Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423:241-254, 2003.

11 11-12 Transcription and development

Large-scale sequencing and microarray projects allow us to study complex biological processes. We look at two specific examples from Drosophila development: In the first, researchers gathered expression data of thousands of genes together with images of the localization of the mRNAs in the developing early embryo; the second deals with expression during the whole life cycle and tissue development.

11.1 Tomancak P, Beaton A, Weiszmann R, Kwan E, Shu S, Lewis SE, Richards S, Ashburner M, Hartenstein V, Celniker SE, Rubin GM. Systematic determination of patterns of gene expression during Drosophila embryogenesis. Genome Biol. 3:RESEARCH0088, 2002.

11.2 Reinke V, Smith HE, Nance J, Wang J, Van Doren C, Begley R, Jones SJ, Davis EB, Scherer S, Ward S, Kim SK. A global profile of germline gene expression in C. elegans. Mol Cell 6:605-616, 2000.

12 11-19 Large scale measurements of binding sites

A large number of proteins are known are predicted to be transcription factors, but their specific targets and preferred binding sequences remain unknown. This session focuses on experimental approaches to determine the preferred DNA sequences interacting with a single factor, or the large-grain location of many factors throughout a whole genome.

12.1 Roulet E, Busso S, Camargo AA, Simpson AJ, Mermod N, Bucher P. High-throughput SELEX SAGE method for quantitative modeling of transcription-factor binding sites. Nat Biotechnol. 20:831-5, 2002.

12.2 Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, Volkert TL, Wilson CJ, Bell SP, Young RA. Genome-wide location and function of DNA binding proteins. Science 290:2306-2309, 2000.

13 11-26 Networks

Given the vast amounts of expression, sequence and functional information currently available, it is possible to start making inferences about modules of genes that work together and how they are regulated. These modules can in turn interact and regulate the expression and function of other modules. Here we study some recent examples of predictions about regulatory networks.

13.1 Segal E, Shapira M, Regev A, Pe'er D, Botstein D, Koller D, Friedman N. Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat Genet. 34:166-176, 2003.

13.2 Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, Zeitlinger J, Jennings EG, Murray HL, Gordon DB, Ren B, Wyrick JJ, Tagne JB, Volkert TL, Fraenkel E, Gifford DK, Young RA.Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298:799-804, 2002.

14 12-3 Networks and models

A model has the ability of summarizing information in a succinct, usually mathematical formulation that allows one to distill the concepts, general principles and also make new predictions. Here we discuss some of the computational models of genetic circuits, the difficulties involved in proposing and evaluating models and the future directions in the field.

14.1 Yuh CH, Bolouri H, Davidson EH. Genomic cis-regulatory logic: experimental and computational analysis of a sea urchin gene. Science 279:1896-902, 1998.

14.2 Kerszberg M, Changeux JP. A Model For Reading Morphogenetic Gradients - Autocatalysis and Competition At the Gene Level. Proc Natl Acad Sci USA. 91:5823-5827, 1994.

15 12-10 Wrap-up. Everything we did not tell you so far. Pizza etc