GA-MSA
GA-MSA is an algorithm that tries to solve Multiple Sequence Alignment problem using Genetic Algorithm. This project was developed for my Bioinformatics class at GIK.
# Approach
- Once a sequence is provided, an Genetic Algorithm > Initial Population is selected by randomly selecting different Pairwise Alignments of each sequence.
- The below steps are executed until we find a good enough solution or the maximum number of generations/iterations have been reached.
- Add required gaps so all the sequences in an Genetic Algorithm > Organism have the same length.
- Calculate the Genetic Algorithm > Fitness of each Genetic Algorithm > Organism using sum of Pairwise Alignment.
- Calculate the Genetic Algorithm > Fitness value of the whole Genetic Algorithm > Population by summing the fitness values of all the organisms in the population.
- If the current population’s fitness value is better than previous best known value, then update the best value.
- Check if the termination criteria is met.
- Apply Crossover.
- Apply Mutation.
- Return the best known Genetic Algorithm > Organism.
# Crossover
- Select organisms from a Discrete Probability Distribution using Vose Alias method, where probabilities were calculated using their fitness score. Organisms with higher fitness score have a higher probability of being selected.
- About 50% of organisms are copied as it is, without any crossover, to be included in the next population.
- 50% of organisms are selected as parents, with a certain probability of applying either of the following:
- Horizontal Recombination Probability of selection = 30%
- Vertical Recombination Probability of selection = 20%
- Randomly select either of the two parents. Probability of selection = 50%
# Horizontal Recombination
Randomly select each sequence from one of the parents. e.g.
|
|
# Vertical Recombination
Randomly define a cut point in the sequence and build the offspring by copying the sequence from position 1 up to the cut point from one parent and from the cut point to the end of the sequence from the other parent.
Keep track of the gaps before split point to maintain the integrity of the sequence. If split point isn’t valid, select either of the parents as the offspring.
# Mutation
Following four operators are randomly selected for mutation:
- Gap Open: A position in the sequence is randomly selected and a block of gaps of variable size (1 - 20% of the max sequence length) is inserted into the sequence.
- Gap Extension: Select a random block of gaps and extend it by one.
- Gap Reduction: Select a random block of gaps and reduce it by one. Can act like a Gap Close operator in case of gap block of size 1.
- None: No mutation is performed.