- [
] Ota T, et al. (2004) "Complete sequencing and characterization of 21,243 full-length human cDNAs." Nat Genet. 36(1):40-45. PMID:14702039 As a base for human transcriptome and functional genomics, we created the "full-length long Japan" (FLJ) collection of sequenced human cDNAs. We determined the entire sequence of 21,243 selected clones and found that 14,490 cDNAs (10,897 clusters) were unique to the FLJ collection. About half of them (5,416) seemed to be protein-coding. Of those, 1,999 clusters had not been predicted by computational methods. The distribution of GC content of nonpredicted cDNAs had a peak at approximately 58% compared with a peak at approximately 42%for predicted cDNAs. Thus, there seems to be a slight bias against GC-rich transcripts in current gene prediction procedures. The rest of the cDNAs unique to the FLJ collection (5,481) contained no obvious open reading frames (ORFs) and thus are candidate noncoding RNAs. About one-fourth of them (1,378) showed a clear pattern of splicing. The distribution of GC content of noncoding cDNAs was narrow and had a peak at approximately 42%, relatively low compared with that of protein-coding cDNAs.
- [
] Suzuki Y, et al. (2004) "Sequence comparison of human and mouse genes reveals a homologous block structure in the promoter regions." Genome Res. 14(9):1711-1718. PMID:15342556 Comparative sequence analysis was carried out for the regions adjacent to experimentally validated transcriptional start sites (TSSs), using 3324 pairs of human and mouse genes. We aligned the upstream putative promoter sequences over the 1-kb proximal regions and found that the sequence conservation could not be further extended at, on average, 510 bp upstream positions of the TSSs. This discontinuous manner of the sequence conservation revealed a "block" structure in about one-third of the putative promoter regions. Consistently, we also observed that G+C content and CpG frequency were significantly different inside and outside the blocks. Within the blocks, the sequence identity was uniformly 65% regardless of their length. About 90% of the previously characterized transcription factor binding sites were located within those blocks. In 46% of the blocks, the 5' ends were bounded by interspersed repetitive elements, some of which may have nucleated the genomic rearrangements. The length of the blocks was shortest in the promoters of genes encoding transcription factors and of genes whose expression patterns are brain specific, which suggests that the evolutional diversifications in the transcriptional modulations should be the most marked in these populations of genes.
- [
] Ohara O, et al. (2002) "Characterization of size-fractionated cDNA libraries generated by the in vitro recombination-assisted method." DNA Res. 9(2):47-57. PMID:12056414 We here modified a previously reported method for the construction of cDNA libraries by employing an in vitro recombination reaction to make it more suitable for comprehensive cDNA analysis. For the evaluation of the modified method, sets of size-selected cDNA libraries of four different mouse tissues and human brain were constructed and characterized. Clustering analysis of the 3' end sequence data of the mouse cDNA libraries indicated that each of the size-fractionated libraries was complex enough for comprehensive cDNA analysis and that the occurrence rates of unidentified cDNAs varied considerably depending on their size and on the tissue source. In addition, the end sequence data of human brain cDNAs thus generated showed that this method decreased the occurrence rates of chimeric clones by more than fivefold compared to conventional ligation-assisted methods when the cDNAs were larger than 5 kb. To further evaluate this method, we entirely sequenced 13 human unidentified cDNAs, named KIAA1990-KIAA2002, and characterized them in terms of the predicted protein sequences and their expression profiles. Taking all these results together, we here conclude that this new method for the construction of size-fractionated cDNA libraries makes it possible to analyze cDNAs efficiently and comprehensively.
- [
] Strausberg RL, et al. (2002) "Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences." Proc Natl Acad Sci U S A. 99(26):16899-16903. PMID:12477932 The National Institutes of Health Mammalian Gene Collection (MGC) Program is a multiinstitutional effort to identify and sequence a cDNA clone containing a complete ORF for each human and mouse gene. ESTs were generated from libraries enriched for full-length cDNAs and analyzed to identify candidate full-ORF clones, which then were sequenced to high accuracy. The MGC has currently sequenced and verified the full ORF for a nonredundant set of >9,000 human and >6,000 mouse genes. Candidate full-ORF clones for an additional 7,800 human and 3,500 mouse genes also have been identified. All MGC sequences and clones are available without restriction through public databases and clone distribution networks (see http:mgc.nci.nih.gov).