- [
] Kimura K, et al. (2006) "Diversification of transcriptional modulation: large-scale identification and characterization of putative alternative promoters of human genes." Genome Res. 16(1):55-65. PMID:16344560 By analyzing 1,780,295 5'-end sequences of human full-length cDNAs derived from 164 kinds of oligo-cap cDNA libraries, we identified 269,774 independent positions of transcriptional start sites (TSSs) for 14,628 human RefSeq genes. These TSSs were clustered into 30,964 clusters that were separated from each other by more than 500 bp and thus are very likely to constitute mutually distinct alternative promoters. To our surprise, at least 7674 (52%) human RefSeq genes were subject to regulation by putative alternative promoters (PAPs). On average, there were 3.1 PAPs per gene, with the composition of one CpG-island-containing promoter per 2.6 CpG-less promoters. In 17% of the PAP-containing loci, tissue-specific use of the PAPs was observed. The richest tissue sources of the tissue-specific PAPs were testis and brain. It was also intriguing that the PAP-containing promoters were enriched in the genes encoding signal transduction-related proteins and were rarer in the genes encoding extracellular proteins, possibly reflecting the varied functional requirement for and the restricted expression of those categories of genes, respectively. The patterns of the first exons were highly diverse as well. On average, there were 7.7 different splicing types of first exons per locus partly produced by the PAPs, suggesting that a wide variety of transcripts can be achieved by this mechanism. Our findings suggest that use of alternate promoters and consequent alternative use of first exons should play a pivotal role in generating the complexity required for the highly elaborated molecular systems in humans.
- [
] Gerhard DS, et al. (2004) "The status, quality, and expansion of the NIH full-length cDNA project: the Mammalian Gene Collection (MGC)." Genome Res. 14(10B):2121-2127. PMID:15489334 The National Institutes of Health's Mammalian Gene Collection (MGC) project was designed to generate and sequence a publicly accessible cDNA resource containing a complete open reading frame (ORF) for every human and mouse gene. The project initially used a random strategy to select clones from a large number of cDNA libraries from diverse tissues. Candidate clones were chosen based on 5'-EST sequences, and then fully sequenced to high accuracy and analyzed by algorithms developed for this project. Currently, more than 11,000 human and 10,000 mouse genes are represented in MGC by at least one clone with a full ORF. The random selection approach is now reaching a saturation point, and a transition to protocols targeted at the missing transcripts is now required to complete the mouse and human collections. Comparison of the sequence of the MGC clones to reference genome sequences reveals that most cDNA clones are of very high sequence quality, although it is likely that some cDNAs may carry missense variants as a consequence of experimental artifact, such as PCR, cloning, or reverse transcriptase errors. Recently, a rat cDNA component was added to the project, and ongoing frog (Xenopus) and zebrafish (Danio) cDNA projects were expanded to take advantage of the high-throughput MGC pipeline.
- [
] Edgar AJ, et al. (2002) "Cloning and tissue distribution of three murine alpha/beta hydrolase fold protein cDNAs." Biochem Biophys Res Commun. 292(3):617-625. PMID:11922611 We have cloned 3 novel murine cDNAs encoding proteins containing an alpha/beta hydrolase fold; a catalytic domain found in a very wide range of enzymes. These proteins belong to the prosite UPF0017 uncharacterized protein family and we have named them lung alpha/beta hydrolase 1, 2, and 3 (LABH) since they were cloned from lung cDNA. All have 9 coding exons, encoding 412, 425, and 411 residue proteins respectively (46-48 kDa); LABH1 being closely related to LABH3 having 45% identity. All 3 proteins have a single predicted amino-terminus transmembrane domain. An alignment of family members from different phyla enabled the identification of the LABH1 catalytic triad as Ser211, Asp337, and His366. mRNA expression levels of LABH1 and 3 were highest in liver and LABH2 highest in testis. These findings suggest that the LABH proteins consist of a novel family of membrane bound enzymes whose function has yet to be determined.
- [
] Strausberg RL, et al. (2002) "Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences." Proc Natl Acad Sci U S A. 99(26):16899-16903. PMID:12477932 The National Institutes of Health Mammalian Gene Collection (MGC) Program is a multiinstitutional effort to identify and sequence a cDNA clone containing a complete ORF for each human and mouse gene. ESTs were generated from libraries enriched for full-length cDNAs and analyzed to identify candidate full-ORF clones, which then were sequenced to high accuracy. The MGC has currently sequenced and verified the full ORF for a nonredundant set of >9,000 human and >6,000 mouse genes. Candidate full-ORF clones for an additional 7,800 human and 3,500 mouse genes also have been identified. All MGC sequences and clones are available without restriction through public databases and clone distribution networks (see http:mgc.nci.nih.gov).
- [
] Yu W, et al. (1997) "Large-scale concatenation cDNA sequencing." Genome Res. 7(4):353-358. PMID:9110174 A total of 100 kb of DNA derived from 69 individual human brain cDNA clones of 0.7-2.0 kb were sequenced by concatenated cDNA sequencing (CCS), whereby multiple individual DNA fragments are sequenced simultaneously in a single shotgun library. The method yielded accurate sequences and a similar efficiency compared with other shotgun libraries constructed from single DNA fragments (> 20 kb). Computer analyses were carried out on 65 cDNA clone sequences and their corresponding end sequences to examine both nucleic acid and amino acid sequence similarities in the databases. Thirty-seven clones revealed no DNA database matches, 12 clones generated exact matches (> or = 98% identity), and 16 clones generated nonexact matches (57%-97% identity) to either known human or other species genes. Of those 28 matched clones, 8 had corresponding end sequences that failed to identify similarities. In a protein similarity search, 27 clone sequences displayed significant matches, whereas only 20 of the end sequences had matches to known protein sequences. Our data indicate that full-length cDNA insert sequences provide significantly more nucleic acid and protein sequence similarity matches than expressed sequence tags (ESTs) for database searching.
- [
] Andersson B, et al. (1996) "A "double adaptor" method for improved shotgun library construction." Anal Biochem. 236(1):107-113. PMID:8619474 The efficiency of shotgun DNA sequencing depends to a great extent on the quality of the random-subclone libraries used. We here describe a novel "double adaptor" strategy for efficient construction of high-quality shotgun libraries. In this method, randomly sheared and end-repaired fragments are ligated to oligonucleotide adaptors creating 12-base overhangs. Nonphosphorylated oligonucleotides are used, which prevents formation of adaptor dimers and ensures efficient ligation of insert to adaptor. The vector is prepared from a modified M13 vector, by KpnI/PstI digestion followed by ligation to oligonucleotides with ends complementary to the overhangs created in the digest. These adaptors create 5'-overhangs complementary to those on the inserts. Following annealing of insert to vector, the DNA is directly used for transformation without a ligation step. This protocol is robust and shows three- to fivefold higher yield of clones compared to previous protocols. No chimeric clones can be detected and the background of clones without an insert is <1%. The procedure is rapid and shows potential for automation.