Advanced PipMaker Instructions

Advanced PipMaker processes the contents of the same four files as does the basic form of PipMaker. The pip begins with a one-line overview of the first sequence. If an underlay file was given, the first row of the overview shows it. The bottom (and perhaps only) row of the overview shows aligned regions in green and strongly aligned regions (at least 100 bp without a gap and with at least 70% nucleotide identity) in red. Legends for the icons representing interspersed repeats and the meanings of colors in the underlay and annotation files (see below) are place on separate pages. In addition, a number of choices, options, and new capabilities are provided.

TABLE OF CONTENTS

Option: color
Option: annotations with hyperlinks
Pick one: Search one strand or Search both strands
Pick one: Show all matches, Chaining, or Single coverage
Option: high sensitivity and low time limit
Select forms of output
Analysis of exons
Raw blastz output
Order and orient contigs
PDF with embedded contig names
User-controlled masking of interspersed repeats and low-complexity regions

Option: color

Regions of interest in the first sequence, such as the locations of exons or regulatory elements, can be colored by vertical stripes. To use this feature of Advanced PipMaker, upload a first sequence underlay file, which should begin with lines like:

Red Strongly_conserved

followed by lines like:

1000 2000 Strongly_conserved

The first set of lines describes the intended interpretion of each color, while the second tells where the colored stripes are to be placed. In color descriptions, colors (the first word on each line) must be selected from a restricted list. The second word on each of these lines is chosen arbitrarily by the user. It is possible to paint just the upper half of a vertical stripe by requesting, e.g.:

50060 50227 Strongly_conserved +

which was done for the BTK pip. Use "-" to paint just the lower half. This permits annotations to differentiate between the two strands, or to plot potentially overlapping features like gene predictions and database matches. (Painting over a stripe, as with the EST region in

1200 1500 EST
1000 2000 Strongly_conserved

will make it invisible.) Beware that the appearance of the colors will vary from one printer or monitor to the next.

Option: annotations with hyperlinks

Advanced PipMaker accepts a file containing annotations that associate World-Wide Web hyperlinks with specified positions in the first sequence. The resulting PDF file (example) is decorated with colored bars, which are clickable when viewed using Acrobat Reader. PipMaker produces these links if the user supplies an annotation file.

The user-supplied file (example) defines various types of hyperlinks and associates a color with each of them, then specifies the type, position, description, and URL for each annotated feature. For instance, in the example,

%define type
%name PubMed
%color Blue

requests that each feature identified as a PubMed entry be colored blue. The name must be a single word, perhaps containing underline characters, as in Entry_in_GenBank. Colors start with capital letters. The subsequent entry

%define annotation
%type PubMed
%range 1 2000
%label Yang et al. 1997.  Daxx, a novel Fas-binding protein...
%summary Yang, X., Khosravi-Far, R. Chang, H., and Baltimore, D. (1997).
  Daxx, a novel Fas-binding protein that activates JNK and apoptosis.
  Cell 89(7):1067-76.
  Click to see Abstract.
%url http://www.ncbi.nlm.nih.gov:80/entrez/
query.fcgi?cmd=Retrieve&db=PubMed&list_uids=9215629&dopt=Abstract

associates a PubMed annotation with positions 1-2000 in the first sequence. Note that summaries and URLs (but not labels) can be broken into several lines for convenience; the line breaks are removed when the file is read, but they are not replaced with spaces. Thus a continuation line for a summary typically begins with a space to separate it from the last word of the previous line, while a URL continuation does not. If the summary is omitted, it is assumed to be the same as the label. Annotations can overlap, as shown in this example. Each %define line except for the first one should be immediately preceded by a blank line.

Clicking on this blue bar will bring up a page that displays a description of the feature (as given in the %summary stanza), including a "hand" icon that can be clicked on to visit the specified URL. (Within the PDF document, the Acrobat Reader's navigational controls should be used; e.g., the web browser's Back button will not return to the PIP.)

Pick one: Search one strand or Search both strands

The alignment program can optionally report only regions of similarity between the first sequence and the second sequence in its given orientation. The default is to align the first sequence with both the second sequence and the reverse complement of the second sequence.

Pick one: Show all matches, Chaining, or Single coverage

With the default setting of "Show all matches", it is possible for one region of the first sequence to align with several regions of the second sequence because of duplications of a gene or an exon, or because of incomplete masking of interspersed repeats or low-complexity regions. Such duplications cause lines to appear one over the other in the pip. PipMaker provides two options for eliminating such duplicate matches, each with its own strengths and weaknesses.

If the "Chaining" option is choosen, then PipMaker will identify and plot only matches that appear in the same relative order in the first and second sequences. This option should be used only if the genomic structures of the two sequences are known to be conserved, since otherwise a duplication might avoid detection. With Chaining, the alignment program is run with lower thresholds (i.e., higher sensitivity).

For an example of chaining, consider the first pip shown below, which is taken from a larger pip based on a 31 kb sequence. As the pip indicates, exon 7 of the first sequence has a number of matches in the second sequence, presumably due to duplications of that exon or of the entire gene. The dotplot view of the entire alignment, shown at the left of the second row, also indicates the duplication (at around position 7000 on the horizontal axis), as well as duplications in later exons. The panels on the right show the results of specifying the "Chaining" option.

An alternative method for avoiding duplicate matches is provided by the "Single coverage" option, which selects a highest-scoring set of alignments such that any position in the first sequence can appear in at most one alignment (though there is no guarantee that order of matching regions is identical in the two sequences). The following three dotplots show a case where this option works better than does chaining. The top panel shows all matches in a gene cluster where the first sequence has six copies of the gene, while the second sequence has four. With chaining, only four of the genes can be matched, as shown in the second panel. The "Single coverage" option selects one match for each region of the first sequence (panel 3).

Further discussion of this example can be found on the Examples page, under beta-globin.

Chaining is preferable to the "Single coverage" option in cases where (1) the second sequence is contiguous, (2) the comparison is with just a single strand of the second sequence and (3) the order of conserved regions is identical in the two sequences, since under conditions 1-3 the results from the "Chaining" option will be more biologically meaningful. (Also, the comparison will be faster.) However, the three-panel example shows that if condition (3) does not hold, the "Chaining" option may give inferior results. A strength of the "Single coverage" option is that it guarantees single coverage even if the second sequence is compared in both orientations, or if it is fragmented (see below). In such cases, the "Chaining" option is applied separately each time the first sequence is compared with an orientation or fragment of the second sequence, so multiple coverage can result.

Option high sensitivity and low time limit

The default settings of PipMaker are tuned to perform well when comparing two mammalian sequences. The option for high sensitivity works better for a sequence pair at a greater evolutionary distance, such as human-fugu. However, if two rather similar sequences are compared with this setting (e.g. human-mouse), the alignment program can run much longer than desired, so we have set the server to terminate execution after a few minutes. This is long enough to permit, say, a successful human-fugu alignment over two BACs.

Select forms of output

The user selects output files from the following list: the pip, a one-page dotplot view of the alignments, the condensed form of the alignment, the traditional textual form of the alignment, an analysis of the exons, raw blastz output, and files predicting the order and orientation of the contigs (assuming the second sequence file is broken into contigs). The last three forms of output are described in more detail below.

The user can supply an optional title for the pip. (By default, the title is taken from the first line of the exons file.) The user can request that the output be in PostScript format instead of PDF format.

Analysis of exons

For the optional analysis of exons, a program attempts to use the alignments to map each position in the "exons" file into the corresponding position in the second sequence. For each coding-region specification (i.e., line beginning with "+") the first and last codons are displayed, and for each intron the first and last two nucleotides are shown. These tri- and dinucleotides are shown for the first sequence and, if possible, for the second sequence. If the coding region is specified, then the putative coding region is printed (for both sequences, if possible). For example, if the exons file contains:

< 27591 30475 L44L
+ 27641 30438
27591 27661
27932 28054
29660 29727
29918 30023
30436 30475

then the program might generate:

< L44L: 27591-30475     35159-37479
  CDS:  27641-30438     35202-37442     TTA-CAT TTA-CAT
   5:   27591-27661     35159-35222     CT-AC   CT-AC
   4:   27932-28054     35586-35708     CT-AC   CT-AC
   3:   29660-29727     36581-36648     CT-AC   CT-AC
   2:   29918-30023     36976-37081     CT-AC   CT-AC
   1:   30436-30475     37440-37479

>L44L, putative CDS for sequence 1 (321 bp)
ATGGTTAACGTCCCTAAAACCCGCCGGACTTTCTGTAAGAAGTGTGGCAA
GCACCAACCCCATAAAGTGACACAGTACAAGAAGGGCAAGGATTCTCTGT
  ....

This output shows that the L44L coding region in both sequences begins with ATG (the reverse complement of CAT) and ends with TAA. Similarly all four introns conform to the normal GT-AG splicing consensus.

When the second sequence file is split into contigs, the output is somewhat more complicated. It begins with an enumeration of the contigs (using their FastA header lines), and positions in the second sequence are identified as i:j which denotes position j in contig i. For instance, the index might include:

Index of fragments of the second sequence:
	1: >Contig101
	2: >Contig102

and the listing of exon positions might contain:

< L44L:	27591-30475	1:35159-2:1479
  CDS:	27641-30438	1:35202-2:1442	TTA-CAT	TTA-CAT
   5:	27591-27661	1:35159-1:35222	CT-AC	CT-AC
   4:	27932-28054	1:35586-1:35708	CT-AC	CT-AC
   3:	29660-29727	2:581-2:648	CT-AC	CT-AC
   2:	29918-30023	2:976-2:1081	CT-AC	CT-AC
   1:	30436-30475	2:1440-2:1479

Here, the left-most CDS position (end of the stop codon) aligns with position 35202 in contig 1, which has FastA header line ">Contig101".

Raw blastz output

The alignment file produced by PipMaker's blastz program can be viewed in laj, which is an interactive tool for viewing and manipulating pairwise alignment output. The laj program, which is written in Java, must be down-loaded and run on the users computer. Also, alignments in this format can be submitted to the SGP-1 gene prediction program.

Order and orient contigs

PipMaker can predict the order and orientation of contigs (contiguous segments) in the second sequence. This will return three files:

A text file containing the predictions, with one line for each oriented and ordered contig (example). The first entry on a line is the range of positions in the first sequence that are covered by the contig's local alignment, when the contig is oriented as predicted. The second entry is the normalized score of the contig's highest-scoring local alignment, obtained by dividing the raw score by the alignment-score threshold. Thus, a small value, say less than 5, indicates that evidence for that contig's predicted position is only weakly supported. The third item in each row is the contig's FastA header line (">" followed by arbitrary characters), with an appended character "-" indicating the reverse complement. Following the contigs for which a prediction could be made, the header lines for all other contigs are given, in the order that they appear in the submitted file.
A dotplot view of the alignment that would be obtained after the predicted order and orientation operations are applied to the second sequence. (If the "before" dotplot using the original order and orientation is desired, it must be explicitly requested by the user.)
The "rearranged" second sequence, obtained by applying the predicted operations.

The second sequence is assumed to consist of contigs, separated either by FastA header lines, or by runs of 100 or more letters "N". PipMaker starts by replacing each such run of N's by a synthesized FastA header line of the form ">word.Ci", where ">word" is the first word on the closest preceding FastA header line present in the original submission, and i runs 2, 3, 4, ... . Thus, a second sequence consisting of two N-free segments separated by 100 Ns, such as

>GeneX an imaginary gene
ACGT...TGCA
NNNN...NNNN
TGCA...ACGT

becomes:

>GeneX an imaginary gene
ACGT...TGCA
>GeneX.C2
TGCA...ACGT

(In the preceding larger example, the second sequence began with the FastA line

>gi|9966970|gb|AC011189.5|AC011189 Homo sapiens chromosome 17 clone RP11-231G16 map 17, WORKING DRAFT SEQUENCE, 39 unordered pieces

from which PipMaker extracted the word "AC011189".)

With the default setting, it is essential that interspersed repeats in the second sequence must not be replaced by long runs of N's in the file submitted to PipMaker. (The letter X could be used, but is not recommended.) On the other hand, the prediction will generally be much more reliable if positions of repeats in the first sequence are given to PipMaker, since otherwise the predictions may be based on spurious alignments.

Optionally, the user can disable the "Break at NN.." option, and thereby get order-and-orientation predictions when NN..N is used to mask regions of the second sequence. XX

PDF with embedded contig names

Notes on configuring Acrobat Reader: Follow the menu File > Preferences > Weblink and set the Link Information selector to "Always Show". (You will need to run Acrobat Reader in stand-alone mode to do this, since the browser plug-in does not have its own File menu.)

With this option, when the Pip is viewed using Adobe Acrobat Reader, a message that corresponds to the mouse position appears, either in the status area at the bottom of the Acrobat window or as a tooltip. When the mouse is positioned in a region of the pip corresponding to a local alignment, the message describes the region covered by the alignment, whereas between two alignments it describes the gap. The first word on the FastA header line for the second sequence, or a contig thereof, is reported. This indicates which contig aligns with this region of the first sequence.

For example, the following messages might be associated with various regions of a certain sequence (submitted as the first sequence to PipMaker).

   region		message
   1-1400	1400 bp unmatched
1401-1586	1401-1586 matches 3146-3324 of Contig35
1587-2191	605 bp gap; 889 bp in Contig35
2192-3367	2192-3367 matches 4214-5531 of Contig35
3368-3888	521 bp gap; between Contig35 and Contig41-
3889-3934	3889-3934 matches 6-51 of Contig41-
3935-4089	155 bp gap; 1013 bp in Contig41-
4090-4375	4090-4375 matches 1065-1340 of Contig41-

The notation "Contig41-" refers to the reverse complement of the sequence with FastA header ">Contig41 ...". In this example, the first sequence has two local alignments with Contig35, followed by two local alignments to the reverse complement of Contig41. Note that the latter pair of alignments are separated by 155 bp (positions 3935-4089) in the first sequence and 1013 bp in Contig41-, perhaps due to an insertion at this point in the second sequence.

These messages in the pip are not intended to be clickable -- we are slightly abusing the URL mechanism by storing the messages as URLs, because it is the most convenient way to display a short string.

User-controlled masking of interspersed repeats and low-complexity regions

PipMaker attempts to eliminate spurious alignments caused by interspersed repeat elements and low-complexity regions, yet indicate meaningful alignments that may involve such elements. For instance, an interpersed repeat or tri-nucleotide repeat sequence may be part of a protein-coding region. The strategy is not perfect, however, and the user may want to take contol of the way these elements are handled during construction of the alignment.

The alignment program used by PipMaker interprets lower case letters as indicating regions of the sequence that are to be masked at early stages of the alignment process, but not at later stages. The alignment program follows the general design of the "gapped Blast" family of programs, which start by finding short, exact matches, then extend those matches to alignments that include gaps. PipMaker ignores regions of the first sequence that contain lower case letters when searching for exact matches, but utilizes those regions when expanding exact matches to form longer alignments. (Regions containing only "N" or "X" characters are not aligned in either phase.) Users of the Advanced PipMaker page can control masking by submitting sequences that follow this convention.