C version, 150 times faster !

I used to use perl to implement to algorithm that alan and I are
working with. The performance is OK as for 5300 scaffolds it takes
70-80 minutes to complete the full comparison. I would not have thought
of re-write it in C if Leon did not ask me for a performance senimar in
the near future.

For
this seminar, I would like to have a real-case CPU-intensive program to
work with SHARK tool, and this gene duplication tool would be perfect
for it. It took me two days to re-write it in C (isnt's it amazing that
I have not touch C for almost 3 years).

Continue reading

Sample Result of My Algorithm

Just fun to save a sample. It looks good to me.

------------------------------------------------------------------ Final ResultSCAFFOLD_1009  : bkmf**le*                    ====                           ===      SCAFFOLD_381   : *od*ig*hl*fcnji*a

LABLE '*' : family unknownLABEL 'a' : family ID=1405LABEL 'b' : family ID=5571LABEL 'c' : family ID=1406LABEL 'd' : family ID=3831LABEL 'e' : family ID=5627LABEL 'f' : family ID=4632LABEL 'g' : family ID=3287LABEL 'h' : family ID=5710LABEL 'i' : family ID=4567LABEL 'j' : family ID=881LABEL 'k' : family ID=1917LABEL 'l' : family ID=23LABEL 'm' : family ID=140LABEL 'n' : family ID=3553LABEL 'o' : family ID=581

------------------------------------------------------------------ Final ResultSCAFFOLD_1009  : ahle**id*                    ====                      ==     SCAFFOLD_40    : jg*ei*kfcb

LABLE '*' : family unknownLABEL 'a' : family ID=5571LABEL 'b' : family ID=4622LABEL 'c' : family ID=4633LABEL 'd' : family ID=5627LABEL 'e' : family ID=4632LABEL 'f' : family ID=4003LABEL 'g' : family ID=2716LABEL 'h' : family ID=1917LABEL 'i' : family ID=23LABEL 'j' : family ID=4634LABEL 'k' : family ID=4624LABEL 'l' : family ID=140

------------------------------------------------------------------ Final ResultSCAFFOLD_1012  : c*bd*a*                    ===                   ===  SCAFFOLD_2455  : *adde*

LABLE '*' : family unknownLABEL 'a' : family ID=4336LABEL 'b' : family ID=23LABEL 'c' : family ID=3233LABEL 'd' : family ID=3457LABEL 'e' : family ID=137

------------------------------------------------------------------ Final ResultSCAFFOLD_10161 : cb                 ==                 ===SCAFFOLD_7699  : cab

LABLE '*' : family unknownLABEL 'a' : family ID=2976LABEL 'b' : family ID=3569LABEL 'c' : family ID=3750

Continue reading

Algorithm design

Alan assigned a algorithm design task to me. The algorithm is used
to detect gene duplication between two different sequences. The
matching criteria are:

  • At least two pairs of genes are from same family
  • We allow a configurable amount of mismatch in between any two consecutive matching genes
  • The distance between any two matching genes must be limited

Continue reading