1. We prepared a list of 617 characterized human housekeeping genes. 1.1 Eric obtained 1183 pairs for housekeeping genes from http://www.sagenet.org/ around October 2000. This list is available from us. 1.2 Webb extracted 272131 pairs from ftp://ncbi.nlm.nih.gov/pub/sage/map/Hs/Nla3/SAGEmap_tag_ug-rel-Nla3-Hs (Homo sapiens UniGene Build #142). 1.3 Webb removed all entries for tags mapped to more than one UniGene cluster and entries without a UniGene ID (marked "mito" or "ribo"), resulting in 174676 pairs. 1.4 Webb determined 722 pairs for housekeeping genes by merging results of steps 1.1 and 1.3. 1.5 Webb extracted 15311 pairs from ftp://ncbi.nlm.nih.gov/repository/UniGene/Hs.data 1.6 Webb determined 697 pairs by merging output of steps 1.4 and 1.5. 1.7 Webb removed duplicated gene names from the output of step 1.6, giving 617 pairs. 2. Webb found the genomic locations of as many as possible of the 617 genes. He searched the tables of RefSeq genes at the Genome Browser (Aug. 2001) to find putative positions of the 617 genes on specific chromosomes, i.e., chromosomes 1-22, X and Y. Genomic positions were not identified for 70 of the genes. Of the 547 that could be placed, 86 had more than one associated genomic location. In many cases, these were putative splice variants, but a number were genuine duplications. When two putative locations for the same gene differed in the number of exons, the program picked the one with more exons (e.g., to ignore processed pseudo-genes); ties were broken arbitrarily. 3. Eric checked this list of duplicates. In 7 cases, the automatically generated genomic placement (step 2) was found to be inferior to another location, in some cases due to a misassembly of the genomic sequence. In 11 cases, none of the placements looked adequate (mostly because the mRNA sequence was poorly covered); we removed those genes from the list; 536 genes remained. 4. Webb extend each RefSeq prediction in cases where the ENSEMBL prediction indicated additional exons or a longer UTR. In 141 case this increased the predicted gene length. The 5' end of each putative gene was checked for the presence of at least a weak CpG island (200 bp where CpG/GpC >= 0.6). 25 putative genes did not have such a CpG island 5. Eric checked the genes whose UTRs were extended at least 200 bp in step 4. In 21 cases, our automatic utilization of ENSEMBL data appeared to be incorrect; the RefSeq prediction was used in those cases. In 5 cases, the number of exons in the ENSEMBL predictions differed from RefSeq; We discarded those genes. Eric also checked the genes lacking a CpG island. All but 8 were found to be problematical and were discarded. For the remaining 8 genes, it was possible to utilize known mRNA to extend the gene to reach a CpG island; this was done manually. 6. Webb wrote a program to predict neighbors and nested genes. 7. Eric checked of these and found 23 genes that should be deleted. 8. Webb checked the reported quality of the genomic sequence in the segment extending 10 kb on either side of each gene. The human assembly available at the Genome Browser indicated both small and large discontinuities. Apparently, small gap lie between contigs from the same genomic clone, and large gaps lie between non-overlapping clones in the "tile path". We decided to restrict further analysis to the cases where no gaps were reported within 10 kb of the gene. This left 354 genes. 9. Webb wrote a program to add the length of the 3' UTRs to the table. 10. Webb wrote a program to add CpG islands length and position relative to start site to the table.