The Needleman-Wunsch algorithm is the one of the most popular algorithms for global sequence alignment. Its variant, Smith-Waterman algorithm is used for local alignment.
Needleman-Wunsch
Creating the Matrix and Utility Methods
The first step of the Needleman-Wunsch algorithm is to set up the alignment matrix. We add one extra row and column in case the alignment starts with a gap.
1 | m = len(seq1) |
We also define a utility method for calculating the score based on the given scoring matrix.
1 | BLOSUM62 = { |
Populate the First Row and Column of the Matrix
We first populate the first row and column with gap penalty scores.
1 | for i in range(len(seq1) + 1): |
Filling in the Rest of the Matrix
We fill in the rest of the matrix with scores calculated using this formula:
$$
M(i,j)=\max\begin{cases}
M(i-1,j-1)+s(i,j) \
M(i-1,j)+g \
M(i,j-1)+g
\end{cases}
$$
$s(i,j)$ is the score of $i$ aligning with $j$. $M(i,j)$ is the value at location $i,j$ in the Needleman-Wunsch matrix. $g$ is the gap penalty (assuming linear penalty).
The diagonal path from $M(i-1,j-1)$ to $M(i,j)$ indicates a match. A horizontal path or vertical path indicates an indel (insertion or deletion, gap).
1 | for i in range(1, len(seq2) + 1): |
The formula used to calculate each individual score ensures that the assigned score is always produced from the highest scoring path. By the principles of dynamic programming, we can combine the optimal solutions of subproblems (local highest-scoring paths) to find the solution to the overall problem (highest-scoring global alignment).
Tracing Back
Since we know that by combining optimal subproblem solutions, we can get the optimal overall solution, we can find the best global alignment by tracing back along the highest-scoring path from the bottom-right corner to the origin.
1 | while i > 0 and j > 0: |
In case of leading gaps, we prepend the gaps.
1 | while i > 0: |
Since we get the result by tracing backwards, the result is reversed. To get the actual result, we simply invert the string.
1 | aligned1, aligned2 = aligned1[::-1], aligned2[::-1] |
Time Complexity Analysis
The big-O time complexity of the Needleman-Wunsch algorithm is $O(mn)$ because ultimately we have to go through every element in the matrix. However, by doing some optimization and preprocessing, we can improve the runtime complexity and optimize the constant terms and factors.
Smith-Waterman
Smith-Waterman is a modified version of Needleman-Wunsch algorithm used for local alignment problems. Rather than attempting to align the whole sequence, local alignment algorithms align the most similar (conserved) regions/segments within the sequences to be aligned. Smith-Waterman’s main difference from Needleman-Wunsch is that it uses a slightly different formula to fill in the matrix.
$$
M(i,j)=\max\begin{cases}
M(i-1,j-1)+s(i,j) \
M(i-1,j)+g \
M(i,j-1)+g \
0
\end{cases}
$$
Basically, whenever the score falls below zero, the score is set equal to zero. By doing so, the high-scoring segments can easily stand out from the rest of the matrix.
For the traceback step, we starts from the element with the highest score and stops when we hit zero.
1 | # locate the max |
EMBOSS - Needle, Water, Stretcher, and Matcher
EMBOSS (European Molecular Biology Open Software Suite) is a set of bioinformatics tools. Its online version is hosted by EMBL-EBI on https://www.ebi.ac.uk/Tools/emboss/
EMBOSS includes Needle and Water, two alignment tools developed using the Needleman-Wunsch and Smith-Waterman, respectively.
To use Needle or Water only, simply go to the website and paste sequence. We can choose the scoring matrix used to align two sequences as well as gap penalties. They use an affine penalty for gap extension, which is also the more widely used gap penalty function. The affine gap penalty increases by a certain amount when a gap is extended. It is implemented by keeping track of three set of scores, as opposed to one set for linear penalty. The score is assigned using the following set of formulas:
$$
M(i,j)=\max\begin{cases}
M(i-1,j-1)+s(i,j) \
I_x(i-1,j-1)+s(i,j) \
I_y(i-1,j-1)+s(i,j)
\end{cases}
$$
$$
I_x(i,j)=\max\begin{cases}
M(i-1,j)+h+g \
I_x(i-1,j)+g
\end{cases}
$$
$$
I_y(i,j)=\max\begin{cases}
M(i,j-1)+h+g \
I_y(i,j-1)+g
\end{cases}
$$
where $h$ is the gap opening penalty and $g$ is the gap extension penalty.
EMBOSS also has a command-line version. The mandatory qualifiers are asequence
, bsequence
, gapopen
, gapextend
, and outfile
.
1 | needle -asequence=SEQUENCE1 -bsequence=SEQUENCE2 -gapopen=GAP_OPEN_PENALTY -gapextend=GAP_EXTENSION_PENALTY -outfile=OUTPUT |
For example, we use the following command to align two sequences named sarsspike.fa
and sars2spike.fa
with a gap open penalty of 10, gap extension penalty of 0.5, and print the result to stdout
.
1 | needle -asequence=sarsspike.fa -bsequence=sars2spike.fa -gapopen=10 -gapextend=0.5 -outfile=stdout |
We can also specify the scoring matrix using the qualifier -datafile
. The built-in data files are located at .../EMBOSS/data
.
Stretcher is a rapid global alignment tool built using optimized Needleman-Wunsch algorithm, and Matcher is a rapid local alignment tool built using optimized Smith-Waterman.
Reference
Pevsner, J. (2016). Bioinformatics and Functional Genomics. Chichester, West Sussex: Wiley Blackwell.
Zvelebil, M. J., & Baum, J. O. (2008). Understanding Bioinformatics. New York: Garland Science/Taylor & Francis Group.
Christopher Burge, David Gifford, and Ernest Fraenkel. 7.91J Foundations of Computational and Systems Biology. Spring 2014. Massachusetts Institute of Technology: MIT OpenCourseWare, https://ocw.mit.edu. License: Creative Commons BY-NC-SA.