DNA/Protein Sequence Analysis: Multiple Comparison



PILEUP

PileUp creates a multiple sequence alignment from a group of related sequences using progressive, pairwise alignments. It can also plot a tree (dendrogram) showing the clustering relationships used to create the alignment.

PileUp also creates a multiple sequence alignment using a simplification of the progressive alignment method of Feng and Doolittle (Journal of Molecular Evolution 25; 351-360 (1987)). The method used is similar to the method described by Higgins and Sharp (CABIOS 5; 151-153 (1989)).

The multiple alignment procedure begins with the pairwise alignment of the two most similar sequences, producing a cluster of two aligned sequences. This cluster can then be aligned to the next most related sequence or cluster of aligned sequences. Two clusters of sequences can be aligned by a simple extension of the pairwise alignment of two individual sequences. The final alignment is achieved by a series of progressive, pairwise alignments that include increasingly dissimilar sequences and clusters, until all sequences have been included in the final pairwise alignment.

Before alignment, the sequences are first clustered by similarity to produce a dendrogram, or tree representation of clustering relationships. It is this dendrogram that directs the order of the subsequent pairwise alignments. PileUp can plot this dendrogram so that you can see the order of the pairwise alignments that created the final alignment.

As a general rule, PileUp can align up to 500 sequences, with any single sequence in the final alignment restricted to a maximum length of 7,000 characters (including gap characters inserted into the sequence by PileUp to create the alignment). However, if you include long sequences in the alignment, the number of sequences PileUp can align decreases.

 

Screen Monitoring

PileUp names each sequence to be aligned as it is read in. It then displays the message, determines pairwise similarity scores, and shows a quality ratio for every pairwise alignment. This ratio is the alignment's quality divided by the length of the shorter sequence. If x is the number of sequences to be aligned, there are (x(x-1))/2 pairwise alignments whose ratio must be calculated.

Next PileUp displays the message Aligning... as it performs each of the pairwise alignments that together create the final multiple sequence alignment. There are x-1 alignments in this part of the program.

 

Input Files

PileUp accepts multiple (two or more) nucleotide sequences or multiple (two or more) protein sequences as input. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example *pep. The function of PileUp depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either, Type: N or Type: P on the last line of the text heading just above the sequence.

 

Restrictions

PileUp restricts each sequence in the final alignment to a maximum length of 7,000 characters. This maximum length includes the input sequence length plus the total length of all gap characters inserted into the sequence to create the final alignment. By default, each input sequence is restricted to a maximum length of 5,000. Also by default, PileUp can add a maximum of 2,000 gap characters for each sequence in the final alignment.

If you wish to align longer sequences, then you can specify a maximum sequence length of up to 7,000 bp. If you increase the maximum sequence length in this way, then the maximum amount of allowed gapping is automatically reduced so that the final aligned sequence length cannot exceed 7,000 for any sequence.

If you wish to allow for more gapping in the final alignment, then you can specify a maximum number of gap characters for each sequence. If you increase the maximum amount of gapping permitted for each sequence in this way, the maximum sequence length is automatically decreased so that the final aligned sequence length cannot exceed 7,000 for any sequence.

The total length of all of the sequences read into PileUp (including the gap allowance for each sequence) cannot be greater than 2,000,000. By reducing the gap allowance for each sequence you can increase the number of sequences that can be read into the program up to the maximum of 500 sequences.

 

Algorithm

A rigorously optimal alignment of even a small number of short sequences would be intractable, both in terms of memory and time. Therefore, PileUp does a series of progressive, pairwise alignments between sequences and clusters of sequences to generate the final alignment. A cluster consists of two or more already-aligned sequences.

PileUp begins by doing pairwise alignments that score the similarity between every possible pair of sequences. These similarity scores are used to create a clustering order that can be represented as a dendrogram. The clustering strategy represented by the dendrogram is called UPGMA that stands for unweighted pair-group method using arithmetic averages (Sneath, P.H.A. and Sokal, R.R. (1973) in Numerical Taxonomy (pp; 230-234), W.H. Freeman and Company, San Francisco, California, USA).

The dendrogram shows the order of the pairwise alignments of sequences and clusters of sequences that together generate the final alignment. For example:

PileUp uses this clustering order and first aligns the two most-related sequences to each other in order to produce the first cluster. It then aligns the next most related sequence to this cluster or the next two most-related sequences to each other in order to produce another cluster. A series of such pairwise alignments that includes increasingly dissimilar sequences and clusters of sequences at each iteration produces the final alignment.

In the above example, Seq1 and Seq2 are aligned first. Next, Seq3 and Seq4 are aligned. The cluster of Seq1-aligned-to-Seq2 is then aligned to the cluster of Seq3-aligned-to-Seq4. Finally, Seq5 is aligned to the cluster that now contains Seq1 through Seq4 to generate the final alignment of Seq1 through Seq5.

Each pairwise alignment in PileUp uses the method of Needleman and Wunsch (Journal of Molecular Biology 48; 443-453 (1970)), that is extended for use with clusters of aligned sequences rather than only individual sequences. For a pairwise alignment of individual sequences, the comparison score between any two sequence symbols is found in a scoring matrix. For a pairwise alignment of clusters of sequences, the comparison score between any two positions in those clusters is simply the arithmetic average of the scores for all possible symbol comparisons at those positions. When gaps are inserted into a cluster to produce an alignment, they are inserted at the same position in all of the sequences of the cluster.

Because a rigorous optimal alignment of even a small number of short sequences would be intractable, PileUp uses an approach that may not produce the most optimal multiple sequence alignment.

 

Clustering

The approach used by PileUp is sensitive to the order in which sequences are aligned. A clustering algorithm determines this order from the pairwise similarities calculated before the final alignments are done. The goal of the clustering is to see that very similar sequences are aligned to each other before they are aligned to more distantly related sequences. There is, at present, no way for you to modify the order of these alignments.

While PileUp calculates the similarity between each of the sequences, this information is not used by the program to weight the sequences. That is, if there are several very similar sequences, the final alignment may be constrained to minimize the disruption of these sequences.

The dendrogram is not a phylogenetic reconstruction, although the vertical branch lengths are proportional to the distance between the sequences. Its purpose is to represent the clustering order used to create the final alignment. This order is the only information from the dendrogram used by PileUp. See the RELATED PROGRAMS topic for a description of programs in the Wisconsin Package that you can use to create phylogenetic reconstructions from multiple sequence alignments.

 

Global Alignment

If you know the difference between Gap and BestFit, consider PileUp an extension of the Gap program for more than two sequences, rather than an extension of the BestFit program. PileUp, like Gap, tries to find a global optimal alignment, while BestFit finds a local optimal alignment.

Because PileUp aligns sequences along their entire lengths, it is not ideally suited to finding the best local region of similarity (such as a shared motif) among all of the sequences. However, PileUp has been used successfully for this purpose.

By default, PileUp does not penalize gaps occurring at the ends of sequences. Therefore, related sequences that differ in the extent of their sequencing can be reasonably aligned by PileUp. You can override this default by selecting -ENDWeight, in which case length differences among the sequences become significant.

 

Piling Up Unrelated Sequences

PileUp always aligns all of the sequences you specify, even if they are not related. The alignment can be degraded if some of the sequences are not similar to one another.

 

Arbitrary Gap Placement

In any pairwise alignment, the position of the inserted gaps may be arbitrary; equally optimal alignments can be generated by inserting the gaps differently. PileUp can exaggerate these arbitrary differences if you select either the -LOWroad or -HIGhroad parameters. This selection usually affects the final alignment. For the most part, however, the difference between the high road and low road alignments should not be very significant, although you may want to check.

Here is an example showing the difference between high and low road for the alignment of three short sequences. The first pairwise alignment creates an aligned cluster of the two most closely related sequences; the second alignment aligns this cluster to the third sequence creating the final multiple sequence alignment. Although the qualities after the first round alignments are the same, the quality of the final low-road alignment is higher than the high-road one.

High road alignments shift all of the arbitrary gaps in the second sequence or cluster of aligned sequences to the right and all of the arbitrary gaps in the first sequence or cluster of aligned sequences to the left. Low road alignments do the opposite. When neither high road nor low road is selected, the program tries not to insert a gap whenever that is possible and uses the high road when that is not possible.

 

Scoring Matrices

The default scoring matrices are not necessarily appropriate for all alignments. Several alternative scoring matrices suitable for multiple sequence alignments are provided. PileUp chooses default gap creation and extension penalties that are appropriate for the scoring matrix it reads. If you select a different scoring matrix the program will adjust the default gap penalties accordingly.

 

Practice Exercise

The following exercise will use several sequences that you will need to transfer to your main list. There are 2 types of sequences, nucleic acid and peptides. Please place all these sequences into the list you created for this course.

 From the UniProt database: P17538, P15157, P40313, P08218, P00750, P07477, P03951, Q04756

 From the GenBank:primate database: M24400, BC005385, BT007356, BC063475

 

The first PileUp that we will perform is a protein/peptide multiple sequence alignment. Select the protein/peptide sequences from UniProt (there are 8 sequences). Move your cursor to Functions and select PileUp from the Multiple Comparison Menu.

 

The following screen will appear. Click on the Options button to display the options menu.

 

For this exercise, select “don’t penalize gaps at the ends...”, “select top alignment...”, “sequence ordered by similarity....”, and “Plot dendrogram....”. Close the Options window and Select Run from the PileUp Main window.

 

Output of this Alignment

 

The next figure provides the dendrogram of the PileUp alignment. This is not a phylogenetic tree, only a representation of the pairwise comparison used to create the multiple sequence alignment.

 

Close the dendrogram window and go to the Output Manager. Select the “.msf” file and add this file to the Main Window. This will load the sequences into your temporary list, which we will use later.

 

Your Main Window should now contain these sequences.

To become a little more familiar with the PileUp program, select the nucleic acid sequences that are in your main list and run the PileUp program. From the Options Menu, select options that we did not use for the peptide alignment. After you have briefly reviewed the results, be sure to add the “.msf” file from this alignment to your Main List.

 

PlotSimilarity

PlotSimilarity calculates the average similarity among all members of a group of aligned sequences at each position in the alignment, using a user-specified sliding window of comparison. The window of comparison is moved along all sequences, one position at a time, and the average similarity over the entire window is plotted at the middle position of the window. The average similarity across the entire alignment is plotted as a dotted line.

If you give PlotSimilarity a single input sequence, you can choose the range and strand for that sequence, and then PlotSimilarity prompts you for the name, range, and strand of a second input sequence. In this way, you can plot the average similarity between the two aligned sequences created with GAP output files.

PlotSimilarity accepts multiple (two or more) aligned nucleotide sequences or aligned protein sequences as input. The multiple sequence alignment created by the PileUp program can be used as input to PlotSimilarity. The gapped output files from the Gap and BestFit programs, which were created using the Options Menu, can also be used as input to PlotSimilarity. If the first sequence entered into PlotSimilarity is a single sequence, the program prompts you for the second sequence.

 

Algorithm

The average similarity at a position in an alignment is the arithmetic average of the scores of all possible pairwise symbol comparisons among the sequence symbols at that position. The comparison score between any two sequence symbols is the comparison value between those symbols in the scoring matrix multiplied by the weight of each of the two sequences. The average similarity across the entire alignment (plotted as a dotted line) is the sum of the separate window similarities divided by the number of windows.

If “plot the level of identity....” is selected, the program plots a measure of the level of identity among all sequences in the multiple sequence alignment. The calculations are done exactly as described above, but all identical symbol comparisons are given a value of 1; all other comparisons are given a value of 0.

If -PROFile is selected, the program plots a running average of the positional conservation in a profile. The measure of conservation at any position is the difference between the greatest and least values at that position in the profile. The profile is created in a program called ProfileMake. This provides a very comparable result to selecting “Include the plot of overall similarity”, that does not require a Profile to be created.

The PlotSimilarity program provides a graphical representation of a multiple sequence alignment or two sequences generated by BestFit or GAP (you must use the individual sequences generated from the Pairwise Sequence Analysis programs). For this exercise, select the “.msf” file from the peptide alignment, move your cursor to Functions and select PlotSimilarity from the Multiple Comparison Menu.

 

Next, click on the Options button to enter the Options menu.

 

From the Options Menu, select “continuous curve”, “Include the plot of overall similarity”, and “minimum and maximum values calculated.....”. Close the Options Menu and select Run from the PlotSimilarity window.

 

The following page will be displayed that contains a graph of the similar regions identified by the PileUp program.

 

 

PRETTY

Pretty displays multiple sequence alignments and calculates a consensus sequence. It does not create the alignment; it simply displays it.

Pretty prints sequences with their columns aligned and can display a consensus for the alignment, allowing you to look at relationships among the sequences. This program can be used for aligned sequences in an MSF (multiple sequence format) or RSF (rich sequence format) file, or for separate sequences that have had gaps added to make them all align.

Pretty accepts multiple (one or more) aligned nucleotide sequences or aligned protein sequences as input. You can specify an MSF file, such as the output file from a session with PileUp, as input to Pretty such as pileup.msf{*}. Weights can be specified for sequences in MSF files. (See the Vote Weight discussion below.)

 

Weighting Sequences (Vote Weight)

If several of your sequences are very similar, you may not want their votes to dominate the consensus for the column. The vote weight is the vote that each row casts for the consensus. A weight of 1.0 is assumed if no vote weight is specified.

You can assign vote weights to sequences in an MSF file by editing the MSF file and modifying the weight on the name/weight line for each sequence at the top of the file.

For this exercise, we will use the “.msf” file that you created from the nucleic acid alignment. Select the file and move your cursor to Functions Menu and select PRETTY from the Multiple Comparison Menu.

 

From the Main PRETTY window click on the Options button.

 

From the Options Menu, select “Display consensus sequence”, and “show positions agreeing....”. Close the Options Menu and select Run from the main PRETTY window.

 

Example of a PRETTY output file

 

 

ProfileMake

ProfileMake creates a position-specific scoring table, called a profile, that quantitatively represents the information from a group of aligned sequences. The profile can then be used for database searching (ProfileSearch) or sequence alignment (ProfileGap).

ProfileMake uses the method of Gribskov, et al (Proc. Natl. Acad. Sci. USA 84; 4355-4358 (1987)) to create a profile from a group of aligned sequences. A profile is a table that contains all of the comparison information of a group of aligned sequences. These sequences must be previously aligned before running ProfileMake. The profile contains as many rows as there are positions in the aligned sequences. Each row contains a score for the alignment of the corresponding position of the aligned sequences with each possible base or residue.

The profile is the input data for ProfileSearch, which can find sequences in the database similar to your group of aligned sequences, and ProfileGap, which can make an optimal alignment between the aligned sequences and another sequence.

ProfileMake accepts multiple sequences (two or more) all of the same type. You can specify multiple sequences by using an MSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example *pep. The function of ProfileMake depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence.

ProfileMake makes a profile from a multiple sequence alignment. ProfileSearch uses the profile to search a database for sequences with similarity to the group of aligned sequences. ProfileSegments displays optimal alignments between each sequence in the ProfileSearch output list and the group of aligned sequences (represented by the profile consensus). ProfileGap makes optimal alignments between one or more sequences and a group of aligned sequences represented as a profile. ProfileScan finds structural and sequence motifs in protein sequences, using predetermined parameters to determine significance.

 

Algorithm

Similarity Scores

In a scoring matrix, a score can be found for the comparison of any two sequence symbols. Given a group of aligned sequences, a score can be calculated for the comparison of a symbol to each position of the aligned sequences. This comparison score differs from position to position in the aligned sequences, because each position contains a different spectrum of sequence symbols. The overall score is, in a sense, the average of the comparison scores for the sequence symbols found at a particular aligned sequence position.

Each row of a profile contains the scores for a comparison of the corresponding position of a multiple sequence alignment to each possible sequence symbol. For example, if a profile is made from a group of aligned protein sequences, the 10th row of the profile has values for the comparison of the 10th position in the alignment to each possible amino acid. The profile has as many rows as there are positions in the alignment, and each row has as many comparison scores as there are amino acid symbols. Thus, the profile is a position-specific scoring matrix for every position in a multiple sequence alignment.

The consensus sequence character is the symbol with the largest value in each row of the profile. It is used solely for the display of alignments and not for the calculation of the optimal alignment between a profile and a sequence.

The last row of the profile contains the composition for the whole profile. In the A column, for instance, the total number of A's in the multiple sequence alignment is shown.

Sequence Symbol Weights

As stated above, the comparison score of an alignment position and a given sequence symbol is an average of the comparison scores for the different sequence symbols at that position. This average is weighted so that a symbol's weight in the calculation of the average score increases along with its fraction of the symbols at that position. Two types of weighting are currently used. Linear weighting gives a weight to each symbol that is directly proportional to the number of occurrences of that symbol at a given position. The default logarithmic weighting gives a symbol that predominates at a given position a disproportionately higher weight than a symbol that occurs only once. This causes positions in the aligned sequences that have many identical residues to bias the profile more strongly towards the identical residues than when linear weighting is used.

Using either kind of weighting, the weight for a residue is 0 when that residue does not occur at a given position; the weight is 1 when only that residue is found at a given position.

If the number of aligned sequences is fairly small, the sequence symbols observed at each position of the alignment may not represent the whole spectrum of symbols that would be observed if more sequences were available. In these cases, even residues that are not observed at a given position in the alignment should perhaps be given a small weight. For nucleic acids, non-observed bases are given a weight of 0 by default. The default for proteins is to give non-observed amino acids a weight equal to 0.025 divided by the sum of the sequence weights. The -STRINgent command-line parameter gives non-observed sequence symbols a weight of 0.

Gap Coefficients

The profile also includes position-specific gap coefficients, expressed as percentages. The gap coefficient determines the penalty that an alignment must pay in order to create a gap, and the gap length coefficient determines the penalty that must be paid in order to extend a gap. The actual gap penalties are calculated by multiplying the position-specific gap coefficients by the gap penalties specified when running the other Profile programs.

All gaps in the aligned sequences that overlap are treated as a single gap for purposes of calculating gap coefficients. The gap is considered to begin at the position of the leftmost gap character (. or ~) in any of the sequences, and to end at the rightmost gap character. The position-specific gap coefficients are reduced from 100 percent as a function of the longest gap through the position of interest in the aligned sequences. The gap coefficient G and gap length coefficient L are calculated as:

G = C(G) x ( R(G) / (1 + GapLength x R(L) )

L = C(G) x ( R(G) / (1 + GapLength x R(L) )

Where GapLength is the length of the gap as defined above. GapCoefficient (C(G)), GapRatio (R(G)), and GapLengthRatio (R(L)) have default values of 100, 0.33, and 0.1 respectively, but can be changed by optional parameters entered on the command line (see the COMMAND LINE SUMMARY topic below).

You can edit the profile with a text editor and change the gap coefficients to any values you wish.

For this exercise we will use the nucleic acid sequence alignment that you created using PileUp. Select the “.msf” file from your Main List, move your cursor to Functions and select ProfileMake from the Multiple Comparison Menu.

 

From the ProfileMake Window, click on the Options Button.

 

 

From the options Menu, select “exponential weighting” and “give a weight of 0”. Close the Options Menu and select RUN from the ProfileMake window.

 

 

Sample ProfileMake output file

 

ProfileGAP

ProfileGap makes an optimal alignment between a profile and one or more sequences.

Profile analysis is a sequence comparison method for finding and aligning distantly related sequences. The comparison allows a new sequence to be aligned optimally to a family of similar sequences. The comparison uses a scoring matrix (a derivative of the Dayhoff evolutionary distances table or PAM matrix) and an existing optimal alignment of two or more similar protein sequences. The group or "family" of similar sequences are first aligned together to create a multiple sequence alignment. The information in the multiple sequence alignment is then represented quantitatively as a table of position-specific symbol comparison values and gap penalties. This table is called a profile.

The similarity of new sequences to an existing profile can be tested by comparing each new sequence to the profile with the same algorithm used to make optimal alignments. To understand how this is done we must first recall what alignment algorithms do. Alignment algorithms find alignments between two sequences that maximize the number of matches and minimize the number of gaps. The match, for any pair of symbols being compared, is really a value that comes from a scoring matrix that contains a value for every possible pair of sequence symbols. Gaps are given penalties in the same units as the values in the scoring matrix. The best alignment is then simply defined as the alignment for which the sum of the scoring matrix values minus the gap penalties is maximal.

So how does alignment work when a sequence is being aligned to a profile? Each row in the profile corresponds to a position in the original multiple sequence alignment. Each possible sequence symbol has a value (a column) in each row of the profile. The comparison of a sequence symbol to any row of the profile defines a specific value or "profile comparison value." The best alignments of a sequence to a profile are found by aligning the symbols of the sequence to the profile in such a way that the sum of the profile comparison values minus the gap penalties is maximal. The profile also contains gap coefficients that are specific for each position so the penalty for inserting a gap in one part of the alignment might be more or less than in another part. The position-specific gap coefficients penalize gaps in conserved regions more heavily than gaps in more variable regions.

The profile contains a consensus sequence for the display of alignments of other sequences to the profile. The consensus sequence character corresponds to the highest value in the row. Since the table on which the profile is based is usually the Dayhoff evolutionary distance table, the consensus residue is the residue that has the smallest evolutionary distance from all of the residues in that position of the alignment rather than simply the most frequent residue at that position.

 

Looking for Structural Motifs with Profiles

Gribskov, et al. (CABIOS 4; 61-66 (1988)) have aligned the sequences from a number of known protein structural motifs and calculated a group of profiles from these alignments. ProfileScan compares any new protein sequence to each of the profiles in this motif database to find out if any of these known motifs occur in the protein. This is one of the few techniques that can reliably predict the location of structural features in protein sequences.

Database Searching with Profiles

A search of the database using a profile as a probe involves making an optimal alignment of every sequence in the database to the profile and listing the alignments for which the alignment score is outstanding.

The profile method has several advantages over most sequence comparison methods. Profile represents the common characteristics of a family of similar sequences where any single sequence is just one realization of the family's characteristics. Since the profile represents the alignment of a number of known sequences, it contains information that defines where the family of sequences is conserved and where it is variable. The comparison of a new sequence to a profile search can emphasize similarity to conserved regions while tolerating diversity in variable regions. A database search can be more sensitive since each sequence in the database is compared to more generalized information than is possible in searches based on pairwise comparisons between two sequences.

Conventional database searching methods require some minimal level of sequence identity between the sequences for any signal to be generated. The profile search, since it is based on quantitative symbol comparisons, can find similarities between sequences with little or no sequence identity.

The alignment of a sequence to a profile is inherently more sensitive since the whole surface of comparison can be used to find the optimal alignment. Conventional methods of searching like the Wilbur and Lipman method use scores that come from one or a small number of adjacent diagonals. The aligned sequences of many protein families suggest that gaps are frequent even in very similar proteins.

 

Experiments Confirm the Sensitivity of Profile Searching

Experiments reported by Gribskov et al. (Proc. Natl. Acad. Sci. USA 84; 4355-4358 (1987)) show that searching the database with a globin profile creates a distribution of alignment scores that more clearly distinguishes known globins from unrelated sequences. Even globins distantly related to the group used to make the profile were clearly distinguished from non-globin sequences. The non-random part of the distribution of the alignment scores also contained a large number of credibly "globin-like" sequences that were not identified when conventional database searching algorithms were used.

For comparison, the authors searched the PIR protein sequence database with the Lipman-Pearson FASTP program (almost identical to FastA) using human alpha hemoglobin as a probe. The FASTP program selected 244 of the 271 globins in the database. The leghemoglobins could not be clearly distinguished from non-globin sequences.

Steps in Profile Searching

Profile searching has four steps: assembly of a family of related sequences into a multiple sequence alignment with PileUp, construction of a profile from the alignment with the program ProfileMake, comparison of the profile to a database of sequences with ProfileSearch, and finally display of the best similarities found with ProfileSegments. The starting point for the creation of a profile is a sequence or group of aligned sequences. This probe is generally a group of functionally related proteins that have been aligned with tools such as PileUp. A profile, however, can be created from a single sequence.

The profile is then calculated from the multiple sequence alignment with the program ProfileMake. The profile contains position-specific gap coefficients based on the position and length of the gaps in the aligned sequences. The gap and gap length penalty coefficients are higher in regions in which no gaps are observed in the aligned sequences, and lower where gaps are observed. When a sequence is aligned to a profile, gaps will tend to be placed in the same regions they occur in the aligned sequences used to generate the profile.

Profiles, once generated, are provided as the input to ProfileSearch along with a sequence specification like SwissProt:* (the search set). ProfileSearch aligns each sequence in the search set to the profile and makes a list of the sequences with the best alignment scores.

The list is a file of sequence names suitable for input to ProfileSegments which will make and display an optimal alignment of each sequence in the list to the profile consensus sequence. When you have identified a new sequence that belongs to the sequence family from which your profile was calculated, you can align it to the whole multiple sequence family with ProfileGap.

A sequence may be compared to a library of defined profiles, representing known sequence and structural features, with ProfileScan.

References

1. Gribskov, M., McLachlan, A. D., and Eisenberg, D. (1987). Profile Analysis: Detection of Distantly Related Proteins. Proceedings of the National Academy of Sciences USA 84; 4355-4358.

2. Gribskov, M., Homyak, M., Edenfield, J., and Eisenberg, D. (1988). Profile Scanning for Three-Dimensional Structural Patterns in Protein Sequences. Computer Applications in the Biosciences 4; 61-66.

3. Gribskov, M. and Eisenberg, D. (1989). Detection of Protein Structural Features With Profile Analysis. In Techniques in Protein Chemistry, (pp; 108-117), Academic Press, San Diego, California, USA.

4. Gribskov, M., Luethy, R., and Eisenberg, D. (1989). Profile Analysis. In Methods in Enzymology, 183; (pp. 146-159), Academic Press, San Diego, California, USA.

 

ProfileGap requires a profile as one of its input files. You can create profiles from aligned sequences by means of the ProfileMake program. In the ProfileDir directory, GCG provides a large number of amino acid profiles derived from the PROSITE database.

ProfileGap accepts as its other input one or more sequences of the same type as the sequences used to create the profile. You can specify multiple sequences by using a sequence specification with an asterisk (*) wildcard, for example GenEMBL:*. The function of ProfileGap depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.

 

For this exercise we will use the nucleic acid “.msf” file that you created with PileUp and the profile created from this alignment using ProfileMake. Select the “.msf” file and move your cursor to Functions and select ProfileGAP from the Multiple Comparison Menu (see next page).

 

The main window for ProfileGAP will appear. Click on the Profile button. A Choose File menu will appear. Select the profile that you created from ProfileMake (there should be only 1 choice).

Next, click on the Options button. Select “globally align…”, “don’t penalize gaps...”, and “Set thresholds.....”. In the Set thresholds box type “|” and then close the Options Menu. Select RUN in the main ProfileGAP window.

 

Sample ProfileGap output file

 

OVERLAP and NoOVERLAP

Overlap compares two sets of DNA sequences to each other in both orientations using a WordSearch style comparison.

 Overlap accepts two sets of sequences as input and uses the algorithm of Wilbur and Lipman (Proc. Natl. Acad. Sci. USA 80; 726-730 (1983)) to compare each sequence of the first set with each sequence of the second set, in both orientations. Thus, Overlap runs a WordSearch reiteratively, using the first set of sequences as queries. Unlike WordSearch, Overlap looks for overlaps between sequences rather than simply regions of similarity. An overlap is a highly similar region between two sequences that runs the entire length of a register of comparison. Overlap lists the position, length, and stringency of discovered overlaps in an output file.

Overlap accepts two separate groups of multiple (one or more) nucleotide sequences as input. You can specify multiple sequences in a number of ways: by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenEMBL:*.

Overlap identifies sequence similarities using a Wilbur and Lipman-style word comparison (see the WordSearch entry in the Program Manual for information regarding the details of this algorithm and considerations about using this search). Overlap differs from WordSearch in that it accepts a set of query sequences as input and reports overlaps rather than regions of similarity.

Overlap removes gap characters (. and ~) from the input sequences before comparing them.

The output file lists the length, position, and percent similarity (ratio) of each overlap in descending order of sequence and overlap length. It also gives the orientation of each sequence.

Sample Overlap output file

 

NoOverlap identifies the places where a group of nucleotide sequences do not share any common subsequences.

This program determines if there are regions where a group of nucleotide sequences do not share any common subsequences. Witkiewicz, Bolander, and Edwards assert that hybridization probes specific enough to detect individual members of a gene family can be prepared if a region 100 bases or longer can be found that does not have a perfect match of nine or more bases with any other member of the family (BioTechniques 14(3); 458-463). NoOverlap is designed to find out if such regions occur in a group of sequences.

To use NoOverlap, you name a group of related sequences in which you want to find regions that do not share any 9-mer with any other sequence in the group. The resulting output is a list of the sequences that have such regions and the coordinates of the regions where no common 9-mers occur.

NoOverlap accepts multiple (two or more) nucleotide sequences as input. You can specify multiple sequences in a number of ways: by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenEMBL:*.

NoOverlap makes an output file with a list of all the non-overlapping regions in every sequence that meet your requirements for word size and length.

Sample NoOverlap output file