MUSCLE

  • How to download the tool or source code including installation and usage instructions as well as any source code that might be associated with the executable. This should also include a listing of any dependencies for this tool or script.
    Copy the appropriate muscle binary file from http://www.drive5.com/muscle/downloads.htm to a directory.
    Extract the file using tar -zxvf filename
  • Required version of the program necessary to perform the desired task
    muscle3.8.31
  • Sample dataset and expected results to be output
    Excerpt from sample input file ago.fa (FASTA format sequence file, either protein or nucleic acid)
    >AGO1 | Arabidopsis thaliana | protein sequence | AGO1 Group (Dicots and Monocots)
    MVRKRRTDAPSEGGEGSGSREAGPVSGGGRGSQRGGFQQGGGQHQGGRGYTPQPQQGGRG
    GRGYGQPPQQQQQYGGPQEYQGRGRGGPPHQGGRGGYGGGRGGGPSSGPPQRQSVPELHQ
    ATSPTYQAVSSQPTLSEVSPTQVPEPTVLAQQFEQLSVEQGAPSQAIQPIPSSSKAFKFP
    MRPGKGQSGKRCIVKANHFFAELPDKDLHHYDVTITPEVTSRGVNRAVMKQLVDNYRDSH
    LGSRLPAYDGRKSLYTAGPLPFNSKEFRINLLDEEVGAGGQRREREFKVVIKLVARADLH
    HLGMFLEGKQSDAPQEALQVLDIVLRELPTSRYIPVGRSFYSPDIGKKQSLGDGLESWRG
    FYQSIRPTQMGLSLNIDMSSTAFIEANPVIQFVCDLLNRDISSRPLSDADRVKIKKALRG
    VKVEVTHRGNMRRKYRISGLTAVATRELTFPVDERNTQKSVVEYFHETYGFRIQHTQLPC
    LQVGNSNRPNYLPMEVCKIVEGQRYSKRLNERQITALLKVTCQRPIDREKDILQTVQLND
    YAKDNYAQEFGIKISTSLASVEARILPPPWLKYHESGREGTCLPQVGQWNMMNKKMINGG
    TVNNWICINFSRQVQDNLARTFCQELAQMCYVSGMAFNPEPVLPPVSARPEQVEKVLKTR
    YHDATSKLSQGKEIDLLIVILPDNNGSLYGDLKRICETELGIVSQCCLTKHVFKMSKQYM
    ANVALKINVKVGGRNTVLVDALSRRIPLVSDRPTIIFGADVTHPHPGEDSSPSIAAVVAS
    QDWPEITKYAGLVCAQAHRQELIQDLFKEWKDPQKGVVTGGMIKELLIAFRRSTGHKPLR
    IIFYRDGVSEGQFYQVLLYELDAIRKACASLEAGYQPPVTFVVVQKRHHTRLFAQNHNDR
    HSVDRSGNILPGTVVDSKICHPTEFDFYLCSHAGIQGTSRPAHYHVLWDENNFTADGLQS
    LTNNLCYTYARCTRSVSIVPPAYYAHLAAFRARFYMEPETSDSGSMASGSMARGGGMAGR
    STRGPNVNAAVRPLPALKENVKRVMFYC
    >AGO704 | Oryza sativa ssp. japonica | protein sequence | AGO1 Group (Dicots and Monocots)
    MEGGGGRGGYRGDGDGGYGRGGGGYHGDGERGYGRGGGGGGGGGGGYRGDDEGRSSYGRA
    RGGGGGGGGYHGDGEAGYGRGRGGRDYDGGRGGGGRRGGRGGGGSSYHQQPPPDLPQAPE
    PRLAAQYAREIDIAALRAQFKGLTTTTPGAASSQFPARPGFGAAGEECLVKVNHFFVGLK
    NDNFHHYDVAIAPDPVLKGLFRTIISKLVTERRHTDFGGRLPVYDGRANLYTAGELPFRS
    
    Excerpt from sample output file ago_aligned.fa (aligned FASTA format) generated using muscle -in ago.fa -out ago_aligned.fa
    >AGO704 | Oryza sativa ssp. japonica | protein sequence | AGO1 Group (Dicots and Monocots)
    ----MEGGGGRGGYRGDGDGGYGRGGGGYHGDGERGYGRGGGGGGGGGGGYRG------D
    DEGRSSYGRARGGGGGGGG------------------------YHGDGEAGY--------
    --------------G---RGRGGRDYDGGRG---GGGRRGGRGGGGSSYHQQ--PPPDLP
    QAPEPRLAAQYA----------------REIDIAALRAQFKGLTTTTPGAAS--------
    ------SQFPARPGFGAAGEECLVKVNHFFVGL---KNDNFHHYDVAIAPDPVLKGLFRT
    IISKLVTERRHTDFGGRLPVYDGRANLYTAGELPFRSRELEVEL--------------SG
    SRKFKVAIRHVAPVSLQDLRMVMAGCPAGIPSQALQLLDIVLRDMVLAERNDMGYVAFGR
    SYFSPGLGSRE-LDKGIFAWKGFYQSCRVTQQGLSLNIDMSSTAFIEPGRVLNFVEKAIG
    RRITNAITV-GYFLNNYGNELMRTLKGVKVEVTHRGNLRKKYRIAGFTEQSADVQTFTSS
    DG--IKTVKEYFNKKYNLKLAFGYLPCLQVGSKERPNYLPMELCNIVPGQRYKNRLSPTQ
    VSNLINITNDRPCDRESSIRQTVSSNQYNSTERADEFGIEVDSYPTTLKARVLKAPMLKY
    HDSGRVRVCTPEDGAWNMKDKKVVNGATIKSWACVNLCEGLDNRVVEAFCLQLVRTSKIT
    GLDFA-NVSLPILKADPHNVKTDLPMRYQEACSWSRDNK---ID-LLLVVMTDDKNNASL
    YGDVKRICETEIGVLSQCCRAKQVYKERNVQYCANVALKINAKAGGRNSVFLN-VEASLP
    VVSKSPTIIFGADVTHPGSFDESTPSIASVVASADWPEVTKYNSVVRMQASRKEIIQDL-
    ------------DSIVRELLNAFKRDSKMEPKQLIFYRDGVSEGQFQQVVESEIPEIEKA
    WKSLYAG-KPRITFIVVQKRHHTRLFPNNYNDPRGMDGTGNVRPGTVVDTVICHPREFDF
    FLCSQAGIKGTSRPSHYHVLRDDNNFTADQLQSVTNNLCYLYTSCTRSVSIPPPVYYAHK
    LAFRARFYLTQVPVAGG----------------------DPGAAKFQWVLPEIKEEVKKS
    MFFC
    >AGO716 | Oryza sativa ssp. japonica | protein sequence | AGO1 Group (Dicots and Monocots)
    MESQRMT-------------------------------------------WLY------D
    RHHSLKHNKAER------------------------------------------------
    ------------------------------------------------------------
    QAILSTYRLAKR------------------------------------------------
    ---------PNLSSEGMIGESCIVRTNCFSVHLESLDDQTIYEYDVCVTPEV---GINRA
    
  • Set of parameters and command line switches that match the expected execution of the tool including the possible command line definitions according to the occurrence of optional parameters. Also, validation instructions for parameters are requested.
    There are two types of command-line options: value options and flag options. Value options are followed by the value of the given parameter, for example --in <filename>; flag options just stand for themselves, such as --msf. All options are a dash (not two dashes!) followed by a long name; there are no single-letter equivalents. Value options must be separated from their values by white space in the command line. Thus, muscle does not follow Unix, Linux or Posix standards, for which we apologize. The order in which options are given is irrelevant unless two options contradict, in which case the right-most option silently wins.

actual command-line parameter

name and brief description of the parameter

required

default value

text, number, or name of file

description of validation rules

anchorspacing

minimum spacing between anchor columns; used for tree-dependent refinements

Y

32

integer

>=0

center

center paramater; used when specifying a protein substitituion matrix

Y

[1]

floating point

<=0

cluster1
cluster2

Clustering method; cluster 1 is used in iteration 1 and 2, cluster 2 in later iterations

Y

upgmb

text

upgma
upgmb
neighborjoining

clwout

Write ouptut in CLUSTALW format to given file name

N

none

file name

 

diagbreak

Maximum distance between two diagonols that allows them to merge into one diagonol

Y

1

integer

>=1

diaglength

Maximum length of diagonol

Y

24

integer

>=1

diagmargin

Discard this many positions at ends of diagonol

Y

5

integer

>=0

distance1

Distance measure for iteration 1

Y

kmer6_6 (amino)
kmer4_6 (nucleo)

text

kmer6_6
kmer20_3
kmer20_4
kmer20_3
kmer4_6

distance2

Distance measure for iterations 2, 3, ...

Y

pctid_kimura

text

pctid_kiumar
pctid_log

fastaout

Write output in FASTA format to given file

N

none

file name

 

gapopen

That gap open score

Y

[1]

floating point

must be negative

hydro

Window size for determining whether a region is hydrophobic

Y

5

integer

>=1

hydrofactor

Multiplier for gap open/close penalities in hydrophobic regions

Y

1.2

floating point

 

in

Input file; file that contains the sequences to be aligned

Y

standard input

file name

 

in1

Input alignment; file that conatins an alignment to be refined or appended to the sequences

N

none

file name

 

in2

Input alignment; file that conatins an alignment to be refined or appended to the sequences

N

none

file name

 

log

Log file name (delete existing file)

N

none

file name

 

loga

Log file name (append to existing file)

N

none

file name

 

matrix

File name for substitution matrix in NCBI or WU-BLAST format.  If specified, must all specify -gapopen <g>, -gapextend <e>, -center 0.0 (<g> and <e> must be negative)

N

none

file name

 

maxhours

Maximum time to run in hours.  The actual time may exceed the requested limit

N

none

floating point

Decimals are allowed so 1.5 means one hour and 30 minutes

maxiters

Maximum number of iterations

Y

16

integer

>=1

maxtrees

Maximum number of new trees to build in interation 2

Y

1

integer

>=1

minbestcolscore

Minimum score a column must have to be an achor

Y

[1]

floating point

 

minsmoothscore

Minimum smoothed score a column must have to be an anchor

Y

[1]

floating point

 

msaout

Write output to given file name in MSF format

N

none

file name

 

objscore

Objective score used by tree dependent refinement
sp = sum-of-pairs score
spf = sum-of-pairs score (dimer appromixation)
spm = sp for <100 seqs, otherwise spf
dp = dynamic programming score
ps = average profile-sequence score
xp = cross profile score

Y

spm

text

sp
ps
dp
xp
spf
spm

out

Where to wrie the alignment

Y

standard out

file name

 

phyiout

Write output in Phylip interleaved format to given file name

N

none

file name

 

physout

Write output in Phylip sequential format to given file name

N

none

file name

 

refinewindow

Length of window for -refinew

Y

200

integer

 

root1
root2

Method used to root tree; root1 is used in iteration 1 and 2, root2 in later iterations

Y

pseudo

text

pseduo
midlongestspan
minavgleafdist

scorefile

File name wehre to write a score file.  This contains one line for each column in the alignment.  The line contains ths letters in the columns followed by the average BLOSUM62 score over pairs of letters in the column

N

none

file name

 

seqtype

Sequence type found in input file

Y

auto

text

protein
nucleo
auto

smoothscoreceil

Maximum value of column score for smoothing purposes

Y

[1]

floating point

 

smoothwindow

Window used for anchor column smoothing

Y

7

integer

 

spscore

Compute SP objective score of multiple alignment

N

none

file name

 

SUEFF

Constant used in UPGMB clustering.  Determines the relative fraction of average linkage (SUEFF) vs. nearest-neighbor linkage (1-SUEFF)

Y

0.1

floating point

0<x<1

tree1
tree2

Save tree produced in first or second iteration to give file in Newick (Phylip-compatible) format

N

none

file name

 

usetree

Use given tree as guide tree.  Must be in Newick (Phylip-compatible) format

N

none

file name

 

weight1
weight2

Sequence weighting scheme.  weight1 is used in iterations 1 and 2.  weight2 is used for tree-dependent refinement
none = all sequences have equal weight
henikoff = Henikoff & Henikoff weighting scheme
henikoffpb = Modified Henikoff scheme as used in PSI-BLAST
clustaw = CLUSTALW method
threeway = Gotoh three-way method

Y

clustalw

text

none
henikoff
henikoffpb
gsc
clustalw
threeway

[1] Default depends on the profile scoring function. To determine the default, use --verbose --log and check the log file.

Flags

Flag Option

Set by default?

Description

anchors

Y

Use anchor optimization in tree dependent refinement iterations

brenner

N

Use Steven Brenner's method for computing the root alignment

cluster

N

Perform fast clustering of input sequences.  Use the -tree1 option to save the tree

dimer

N

Use dimer approximation for the SP score (faster, slightly less accurate)

clw

N

Write output in CLUSTALW format (default is FASTA)

clwstrict

N

Write output in CLUSTALW format with the "CLUSTAL W (1.81)" header rather than the MUSCLE version

core

Y

Do not catch exceptions

diags

N

Use diagonol optimizations.  Faster, escpecially for closely related sequences but may be less accurate

diags1

N

Use diagonol optimizations in first iteration

diags2

N

Use diagonol optimizations in second iteration

fasta

Y

Write output in FASTA format

group

Y

Group similar sequences together in output

html

N

Write output in HTML format (default is FASTA)

le

?

Use log-expectation profile score (VTML240).  Alternatives are to use -sp or -sv.  This is the default for amino acid sequences

msf

N

Write output in MSF format.  Designed to be compatible with the GCG package

noanchors

N

Disable anchor optimization.  Default is -anchors

nocore

N

Catch exceptions and give an error message if possible

phyi

N

Write output in Phylip interleaved format

phys

N

Write output in Phylip sequential format

profile

N

Compute profile-profile alignment.  Input alignments must be given using -in1 and -in2 options

quiet

N

Do not display progress messages

refine

N

Input file is already aligned, skip first two iterations and begin tree dependent refinement

refinew

N

Refine an alignment by dividing it into non-overlapping windows and re-aligning each window.  Typically used for whole-genome nucleotide alignments

sp

N

Use sum-of-pairs protein profile score (PAM200).  Default is -le

spscore

N

Compute alignment score of profile-profile alignment. Input alignments must be given using --in1 and --in2 options. These must be pre-aligned with gapped columns as needed, i.e. must be of the same length (have same number of columns).

spn

?

Use sum-of-pairs nucleotide profile score. This is the only option for nucleotides, and is therefore the default. The substitution scores and gap penalty scores are "borrowed" from BLASTZ.

stable

N

Preserve input order of sequences in output file. Default is to group sequences by similarity (--group).
WARNING THIS OPTION WAS BUGGY AND IS NOT SUPPORTED IN v3.8.

sv

N

Use sum-of-pairs profile score (VTML240). Default is --le.

termgaps4

Y

Use 4-way test for treatment of terminal gaps. (Cannot be disabled in this version).

termgapsfull

N

Terminal gaps penalized with full penalty.
Not fully supported in this version.

termgapshalf

Y

Terminal gaps penalized with half penalty.
Not fully supported in this version.

termgapshalflonger

N

Terminal gaps penalized with half penalty if gap relative to longer sequence, other with full penalty.
Not fully supported in this version.

verbose

N

Write parameter settings and progress messages to log file.

version

N

Write version string to stdout and exit.

  • Example invocation of the command line application and its associated parameters such that it can perform an analysis
    Simplest case:
    muscle -in ago.fa -out ago_aligned.fa
    
    Refine an alignment
    muscle -in seqs.afa -out refined.afa -refine
    
    Using a pre-computed guide tree
    muscle -in seqs.fa -out seqs.afa -usetree mytree.phy
    
    Output alignment to multiple file formats
    muscle -in seqs.fa -fastaout seqs.afa -clwout seqs.aln
    
  • Reference
    Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput.Nucleic Acids Res. 32(5):1792-1797.
    doi:10.1093/nar/gkh340