60398

Perl: Search a pattern across array elements

Question:

I am a Perl newbie, stuck with another bioinformatics problem that requires some help and input.

<strong>The problem in brief:</strong>

<ol><li>

I have a file, which has over 40,000 <em>unique</em> DNA sequences. By unique, I mean unique sequence id. I am attaching a portion of it at the end of my post to help you show what it looks like.

</li> <li>

I need to divide <em>each</em> of the 40,000 sequences into 3 parts. So if a particular sequence is 999 character long, each of the 3 parts would have 333 characters.

</li> <li>

I need to look for the following pattern through each of the 3 individual parts:

$gpat = [G]{3,5}; $npat = [A-Z]{1,25};<br /> $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat;

</li> <li>

If $pattern appears in the first of the 3 parts, increase the counter of 'beginning', if $pattern occurs in the 2nd of the 3 parts, increase counter of 'middle' and lastly if the $pattern appears in the 3rd part, increase counter of 'end'.

</li> <li>

Print the counters of 'beginning','middle' and 'end' i.e basically summation of 'beginning','middle','end' for each of the sequences.

Say in 1st sequence, the values are like '2','5','3' respectively and in 2nd sequence, the values are '4','1','6', the final count should be '7,6,9'.

</li> </ol>

<strong>The issues I am having:</strong>

<ol><li>If a particular sequence is split into 3 parts, potential $pattern is lost. eg say on a sequence like :</li> </ol>

gggatgtcgatgcatggggatgcatcgatgcggggactagctagcgggatgctacgatggggatgatgataatatcgcggcgcatatatgctagtctatatatta

a split into 3 parts produces following 3 sub-parts,each of 35 character length:

gggatgtcgatgcatggggatgcatcgatgcgggg<br /> actagctagcgggatgctacgatggggatgatgat<br /> aatatcgcggcgcatatatgctagtctatatatta

Hence, <em>$pattern gets split into the first 2 parts</em>. Is there anyway to say "If $pattern begins in 1st part and ends in 2nd part", increase count of "beginning" ?

<strong>##UPDATE##</strong> The following issue has been resolved thanks to the code suggested by Cupidvogel

<blockquote>

2.How do I divide a sequence into 3 parts if its length is not divisible by 3? I tried using int, but then the last part is 1-2 characters short.

</blockquote>

The following is the code I have written so far.

It reads in the file, displays the header name and sequence, the length into which each sequence will be divided and finally the sequence split into 3 parts which works fine provided the sequence length is divisible by 3, for those which aren't, the final 3rd part is 1-2 characters short.

#Take Filename from user print "Please enter file name : "; $in =<>; chomp $in; open (FASTA,"$in") or die ; while (<FASTA>) { $/=">"; @array = split '\n', $_; $header=shift @array; # Header of the fasta sequence print "\n\nNext sequence: \n"; print $header,"\n"; $seq= join '', @array; # sequence $seq=~s/\s//g; $seq=~s/\*//g; $seq=~s/>//g; print $seq,"\n\n"; $num = int(length($seq)/3); @arr = unpack("A$num A$num A*",$seq); print " New method gives this :", @arr; print "\nThe first element is :", $arr[0]; print "\nThe second element is :",$arr[1]; print "\nThe third element is :",$arr[2] ; #The following lines of code were originally written to split... #...the sequence into 3 parts, albeit unsuccessfully #my $split = (length $seq)/3; #print $split,"\n\n"; #my $int = int $split; #print $int,"\n\n"; #my @array2 = $seq =~ /(.{$int})/g; #print join (" ", @array2),"\n\n"; #print $array2[0],"\n",$array2[1],"\n",$array2[2]; } exit;

I have been trying the code I have written so far with the following sample file : sample.fa

>ABC_123 2 atgtcgatcgatcggcgggcatgcgcgcgcggatg atatatagcgcgcgctatatagcgcgactctacgc atgctgctgactagctatagtcgctgactgcgcgt gggaaaaagggcccgggccccgttttggggatcta ggggatagctgatgctagcatgcatgctgactgca >DEF_456 4 gggatgtcgatgcatggggatgcatcgatgcgggg actagctagcgggatgctacgatggggatgatgat aatatcgcggcgcatatatgctagtctatatatta >GHI_789 1 atagctgctagtcgatcggcgcgggtatcgatcgg ggatcgatcgatcggggatcgatcgggggatcgat

The actual input file looks like the following:

>NR_037701 1 aggagctatgaatattaatgaaagtggtcctgatgcatgcatattaaaca tgcatcttacatatgacacatgttcaccttggggtggagacttaatattt aaatattgcaatcaggccctatacatcaaaaggtctattcaggacatgaa ggcactcaagtatgcaatctctgtaaacccgctagaaccagtcatggtcg gtgggctccttaccaggagaaaattaccgaaatcactcttgtccaatcaa agctgtagttatggctggtggagttcagttagtcagcatctggtggagct gcaagtgttttagtattgtttatttagaggccagtgcttatttagctgct agagaaaaggaaaacttgtggcagttagaacatagtttattcttttaagt gtagggctgcatgacttaacccttgtttggcatggccttaggtcctgttt gtaatttggtatcttgttgccacaaagagtgtgtttggtcagtcttatga cctctattttgacattaatgctggttggttgtgtctaaaccataaaaggg aggggagtataatgaggtgtgtctgacctcttgtcctgtcatggctggga actcagtttctaaggtttttctggggtcctctttgccaagagcgtttcta ttcagttggtggaggggacttaggattttatttttagtttgcagccaggg tcagtacatttcagtcacccccgcccagccctcctgatcctcctgtcatt cctcacatcctgtcattgtcagagattttacagatatagagctgaatcat ttcctgccatctcttttaacacacaggcctcccagatctttctaacccag gacctacttggaaaggcatgctgggtctcttccacagactttaagctctc cctacaccagaatttaggtgagtgctttgaggacatgaagctattcctcc caccaccagtagccttgggctggcccacgccaactgtggagctggagcgg gagggaggagtacagacatggaattttaattctgtaatccagggcttcag ttatgtacaacatccatgccatttgatgattccaccactccttttccatc tcccagaagcctgctttttaatgcccgcttaatattatcagagccgagcc tggaatcaaactgcctctttcaaaacctgccactatatcctggctttgtg acctcagccaagttgcttgactattctcagtctcagtttctgcacctgtc aaatagggtttatgttaacctaactttcagggctgtcaggattaaatgag catgaaccacataaaatgtttggtgtatagtaagtgtacagtaaatactt ccattatcagtccctgcaattctatttttcttccttctctacacagcccc tgtctggctttaaaatgtcctgccctgctttttatgagtggataccccca gccctatgtggattagcaagttaagtaatgacactcagagacagttccat ctttgtccataacttgctctgtgatccagtgtgcatcactcaaacagact atctcttttctcctacaaaacagacagctgcctctcagataatgttgggg gcataggaggaatgggaagcccgctaagagaacagaagtcaaaaacagtt gggttctagatgggaggaggtgtgcgtgcacatgtatgtttgtgtttcag gtcttggaatctcagcaggtcagtcacattgcagtgtgtcgcttcacctg gctccctcttttaaagattttccttccctctttccaactccctgggtcct ggatcctccaacagtgtcagggttagatgccttttatgggccacttgcat tagtgtcctgatagaggcttaatcactgctcagaaactgccttctgccca ctggcaaagggaggcaggggaaatacatgattctaattaatggtccaggc agagaggacactcagaatttcaggactgaagagtatacatgtgtgtgatg gtaaatgggcaaaaatcatcccttggcttctcatgcataatgcatgggca cacagactcaaaccctctctcacacacatacacatatacattgttattcc acacacaaggcataatcccagtgtccagtgcacatgcatacacgcacaca ttcccttcctaggccactgtattgctttcctagggcatcttcttataaga caccagtcgtataaggagcccaccccactcatctgagcttatcaaccaat tacattaggaaagactgtatttcctagtaaggtcacattcagtagtactg agggttgggacttcaacacagctttttgggggatcataattcaacccatg acagccactgagattattatatctccagagaataaatgtgtggagttaaa aggaagatacatgtggtacaaggggtggtaaggcaagggtaaaaggggag ggaggggattgaactagacacagacacatgagcaggactttggggagtgt gttttatatctgtcagatgcctagaacagcacctgaaatatgggactcaa tcattttagtccccttctttctataagtgtgtgtgtgcggatatgtgtgc tagatgttcttgctgtgttaggaggtgataaacatttgtccatgttatat aggtggaaagggtcagactactaaattgtgaagacatcatctgtctgcat ttattgagaatgtgaatatgaaacaagctgcaagtattctataaatgttc actgttattagatattgtatgtctttgtgtccttttattcatgaattctt gcacattatgaagaaagagtccatgtggtcagtgtcttacccggtgtagg gtaaatgcacctgatagcaataacttaagcacacctttataatgacccta tatggcagatgctcctgaatgtgtgtttcgagctagaaaatccgggagtg gccaatcggagattcgtttcttatctataatagacatctgagcccctggc ccatcccatgaaacccaggctgtagagaggattgaggccttaagttttgg gttaaatgacagttgccaggtgtcgctcattagggaaaggggttaagtga aaatgctgtataaactgcatgatgtttgcaggcagttgtggttttcctgc ccagcctgccaccaccgggccatgcggatatgttgtccagcccaacacca caggaccatttctgtatgtaagacaattctatccagcccgccacctctgg actccctcccctgtatgtaagccctcaataaaaccccacgtctcttttgc tggcaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaa >NM_198399 1 aacagattttaactctgaaaagccatttccagtgtctatagactattgtg agcctggagaagtagcatttagttgggatagcttcactagagctgcctgc caaagacttccttccacaggatcttgtcgcaccagcaactgacaggagct tgggagctcgggagcttgggagagggcttatgtttttaataatgtagctg tcagttcgaagcctggaaatgttgaccctcaaagggcataaaatcttgtt attttaatttgcatctgggagaatgtctgagcaaggagacctgaatcagg caatagcagaggaaggagggactgagcaggagacggccactccagagaac ggcattgttaaatcagaaagtctggatgaagaggagaaactggaactgca gaggcggctggaggctcagaatcaagaaagaagaaaatccaagtcaggag caggaaaaggtaaactgactcgcagccttgctgtctgtgaggaatcttct gccagaccaggaggtgaaagtcttcaggatcagactctctgaaaactgca aatggaaaggaattcaaaagaatttagattaaaagttaaataaaaagtag gcacagtagtgctgaattttcctcaaaggctctcttttgataaggctgaa ccaaatataatcccaagtatcctctctccttccttgttggagatgtctta cctctcagctccccaaaatgcacttgcctataagaaacacaattgctggt tcatatgaaacttaggaaatagtgaataaggtgcatttaactttggagaa atacttttatggctttggtggagatttctcaatactgcaaaagttgtcca gaaatgaatctgagctgatggtgactttaagttaatattattaatatatc actgcatatttttacccttatttttgctccttacagcaagattagtaggt tataaaaatttaaatttaaacaaaattatttcatgacaaaatgggaaact tcacatcatacttatttttgtttgcctttcaggcatcatattagctttta taaaaaatggtcttgctgctgaaattgtacttattttatcagaggctggg tgcagtcaagacaaaagtaaaatggtttacctgagcccaggggagggaaa attgattaagatatcattatttttgtttggtttggttttgcttttttcct cttactttaattgaaatactctgaattcccctcatggaaacagagagcat tgagagcactttctttaaaaggaccaaaaataaattcctaatagattttg tcctaagagagtgtttttttttctagcatcattttctttacatgccactc atgtcataaggcatggacaggctatctttcagtggccattactatgtttc gtacacatgctttattttacttgggctctgagaaatgtgtggctttcctt cagcattttatttgtgcttctctttttaatggagattgaaaagggagaat aatgtgaatatcacggcttatattattaaatgttgattgatggcttgtaa tgtactgcacacaatatatgttaactctgcagaatgacagaccctgggag aagtaatgccccagttgtcccccactcctaatgccaggcagagaaggaca gcctttatagacttaatctgctttttgtcccatttgacaaggtaccagga ggaaattttttaagggatcaactgtatcacagtgcccactctggacctaa gtctagtgtatccatacaattggtgcagagaaataaggtgtaaatggtgc tttgttcctgctggttccaagctcagaaaccaagactagctttgtaggag agaatgagagcctgcaagcctctctttggattggctgaggagtggtggga gcagggggttgatagaaaacatccagacacacatataagcaagtggccgt gctacctttttagagaataaagaaacagacttttgagtttatatgcaatg ccttcattaggtaccaccggcacttacaaaatgtgcggactgaatcccag agaacactggcagatgtatacagtatatggattgtatcgcttccccaatg tttgtaaattcacagtatttggaaaactgccttcattttccagtgtggga aaaactcttgctacctgtattacttgatctcagacccatacctgatggtt cagtctgtccttaagttaaaagaattttgcttttctaatgttatactatt tacctgtcagtgtattactgcaacttgaatcactcttttactgttgttgg atataaacttatcctgtaccaatgtatttattaacacttgtattttatta ttgagcatatcaataaaaatattaaaaaataacagattgttttttaccaa aaaaaaaaaaaaa >NR_026816 1 caacccactctctgtgctatgacttcattactctttcccagcccagccct gggcaagccccttacgaagtctcaggctacctggatgaccaccctttctt atgatgctgcaaggagggcaggtgggcagagccccgtgcatcctgggctc aggccagggacccaagagcttgggagaagctggttctcagactgaaggcc agagcccagcaccttgtcaccatcccggggagcatcatggcacacaacaa ccagagccaaggctacagctagagagttgactcctctatttgagattgac aggcctcggaagtcaaaataagtggtttcctagaccgggtcgagagcaag tctctattggtcccaactgagttttttcagctggtttttcaaccaaacag cacctcatctcccagtgaggggaagggaaggctgggctgagagcagcaag gctgctcatctcacctctccccacccagccatgccagccgcctcacctgg tggggagaggtgggcctcacctgggtcccctggcagtgctctgtgaaggg tcttgacattgcactgtaataataaaggtgtgtgtgaagtatcaaaaaaa >NR_027917 1 atgaagatgattgagcagcacaatcaggaatacagggaagggaaacacag cttcacaatggccatgaacgcctttggagaaatgaccagtgaagaattca ggcaggtggtgaatggctttcaaaaccagaagcacaggaaggggaaagtg ctccaggaacctctgcttcatgacatccgcaaatctgtggattggagaga gaaaggctacgtgactcctgtgaaggatcagtgcagctggggctctgtaa ggacagatgttaggaaaactgagaaactagtttcactgagtgtgcagacc tggtggactgctctaggcttcaaggcaatgttggctgcatttttggagaa ccattattttgcttccagtatgttgccgacaatggaggcctggactctga ggaatccttttcatatgaagaaaagctctggagactggaaagtccaaggt cacagaggtgcatctggtgagagccttcttgctagtggggaatctcagca gagtcctgaggtggcacagtattctgggaagcatcaagtgcagtgtcatc ttatcgaggaggctctgcagatgctaagtggtggggatgaggatcacgat gaagacaaatggccccatgacatgaggaatcatctggctggagaggccca ggtgtag >NR_002777 3 cttgtcctttcagaagatcagagacaagtgatatctgtgccaatttggcc ttttcagtgttataattatggtgtcttgggatcccaatatttctcctaat gtttccctgatgtgatactttgagagcccaggatgccagtacaataattg aaattcacaaatgtctggtatcttgtccctcgtgccccatatattatctg tggtttcggagagctcacttgtctcttatcttcagaaatgacagcacatg aaatgttgtttggagccactgtcacatcaactgtagaaaaattaacaggt cagctaagggatataatgtaactttatttgtgatatgagagaaatcttga taaagacttgagagaaaactgggaggaaccttgtttagaagttataagga ggggtaagttatgtgtgtcttggaaggagaatcataaatcttaaaacatg agcctaatagagaacataaaattctaaaagataaagataataataatgat aagccgcagggtggcttatgataatgtgacttctccttaccccagtagcg tcggacatctgtcagctctgaaatgataaaaatgcacaatattgaataca aacaaaggagtcagcactgaaattcattttctctccagattagggaaaga gtaggtatgccctatggtagggcagtaaattgctgaatgatgagatgaaa cagccacctagccatttcccattaaatataatcccatcagcagcagacaa tatctatcctcccctatcccctctatccatatttggaaactgcaccctct tccctatttagcaccctaacaccacttgaattccataaccctgttgttga tctagctctcctcacctctaaacacttctagcattcctttcagatcagga gctcgaaacactctcctttgattttttggaaaagtttctggcttcttcaa ggtcacgttctccgtcctaagaattaaaaaaaaaaaaaaaaacttccaaa cctttgaccttgtgtccgtggaacacccctgacttcctatcatttcaacc cattgaggcacttgaactctcttcttggggatcctgagaagggagagtgc aaactcttgaccctggaggcaaacaaaatgttctcatgtttgccttccca cttactttctgtgagaacgtgggaagatcttaacctctcagaagcacagt ttcttccttctaaaatgaaataattaacctctccctgtctacattcttaa actcataggacataaaaaaaaaaaaaa >NR_033769 1 ggcctctggcgggcctccagccagttagaccatttgactaggacgtgtgc agctcagccagccacagaactggaatttttcaggagcagggggagcatgg agtttggactttgctgagcaactgaagtggagcgcagagcttgctcgctt aggagagggcagcatggatggcaaacaagggggcatggatgggagcaagc ccacggggccaagagactctcctgacaccaggcttctttcaaacccattg atgggtgattctgtgtctgattggtctcctatgcctgaagctgcaatcta cggacatcagctgtctctgaggaacctcatcagccacgggtggcttgtga acatcatcatggcagatcatgtttccccactccatgaagcctgtctcaga ggtcatccctctcgtgtaaagattttattaaagcatggagctcaggtgaa tggcgtgacaacagactggcacactccactgtttaatgtttgtatcagca gcagctgggattatgcttctgcagcatggagccagcgttcaacctgagag tgatctggcatcccccgtccatgaagctgctaggagaggccacgtggagt gtgtcgactctcttacagcttataggggcaaaaatgaccataacatcagc cacgtgggcacttcactgtatttggcttgtgaaaaccagcagatagcctg tgtcaagaagcttctggagtcaggagcagacctgaacccagggagaggtt ccccacttcatgcagtggccttcatgaaggccctcatgaaggattcccca cttcatgcagtggccaggacagccagtgaagagctggcctgcctgctcat ggattttggagcagacacccaggccaagaatgctgaaggcaaatgtcatg tggagctggtgcctccagagagccctttgatccagctcttcttggagaga gaagggcccccttcttttgatgcagttatgcctagaaatcagaagggctt tggaatccagcagcatcataagataaccaaagtcgtcctcccagaggatc tgaaatggtttctcctacatctttgtatgtatcaatggaatggattcaca aacaatgtgaaaacattattgagtgttgtagccactagaattttaaaatc aagttaggtttatagagtttgactagttttttcgattagatttgtattag ttataaatttgttcatagagtttgactaattttttcgattagatttgtat ttgttaaactctgaagccagagtttaaacacactgcatacgtttgtatga ttagttagaaggcatgaagacttttttccctgcttggagactgtctaaaa taacagctattgttttgcatatccactgcaggccaagcactttcagcatc atctaattcagccctcacagcaactgggtcaatctgtccaatttcccagg gcaaggatagaggagtcagattcaaatacaggttttctgacgttaactta tgtgatgatttgatcaaagcaggattttccagcatcactatccttgttcc atctctgctatatgggaatgaaaataaagaaatgtatttcaaaaaaataa aaagaaaagaaaaacagagacggtc >NM_016326 3 atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag tgtgaaaggccacgtgaagatgctgcggctggtgtttgcacttgtgacag cagtatgctgtcttgccgacggggcccttatttaccggaagcttctgttc aatcccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaagaagt tttgtaattttatattactttttagtttgatactaagtattaaacatatt tctgtattcttccacatattttctgcagttattttaactcagtataggag ctagaggaagagatttccgaagtctgcaccccgcgcagagcactactgta acttccaagggagcgctgggagcagcgggatcgggttttccggcacccgg gcctgggtggcagggaagaatgtgccgggatccgcctcagggatctttga atctctttactgcctggctggccggcagctccg >NM_181641 2 atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag tgtgaaaggccacgtgaagatgctgcggctggcactaactgtgacatcta tgaccttttttatcatcgcacaagcccctgaaccatatattgttatcact ggatttgaagtcaccgttatcttatttttcatacttttatatgtactcag acttgatcgattaatgaagtggttattttggcctttgcttgtgtttgcac ttgtgacagcagtatgctgtcttgccgacggggcccttatttaccggaag cttctgttcaatcccagcggtccttaccagaaaaagcctgtgcatgaaaa aaaagaagttttgtaattttatattactttttagtttgatactaagtatt aaacatatttctgtattcttccacatattttctgcagttattttaactca gtataggagctagaggaagagatttccgaagtctgcaccccgcgcagagc actactgtaacttccaagggagcgctgggagcagcgggatcgggttttcc ggcacccgggcctgggtggcagggaagaatgtgccgggatccgcctcagg gatctttgaatctctttactgcctggctggccggcagctccg >NM_001144931 1 gtttccgttcctctgcccgccatgccgttcctagagctgcacacgaattt ccccgccaaccgagtgcccgcggggctggagaaacggctgtgcgccgtcg ctgcctccatcttgggcaaacctgcagaccttgtgaacgtgacggtacgg ccgggcctggccagggcgctgagcgggtccaccgagccctgcgcgcagct gtccatctcctccatcggcgtagtgggcaccgccgaggacaaccgcagcc acagtgcccacttctttgagtttctcaccaaggagctagccctgggccag gaccggtgcgcaggggtagtaggcccggaatattattctaaaacacaatc agagtactccattcctgctaacagtttaaagccaaacacctaggcaggcc atttaggcttctgaatgactgggtcttgaccaggagagctgctgtctagg ttttctcttcctgaccagttcctcaagagaaatgcaaaactagtgattaa cagtaagagtcaggcagggcgcggtggctcacgcctgtaatcccagcact ttgggaggccgag >NR_029429 1 ggacaccaccccaaaatttcctagtcctctttgatacgggttcctccaat ctgtagctgccctccatctactgccagagccaagtctgctccaatcacaa caggttcaatcccagcctgtcctccaccttcagaaacgatggacaaacct atggactatcctatgggagtggcagcctgagtgtgttcctgggctatgac actgtgactgttcataacatcgttgtcaataaccaggagtttggcctgag tgagaatgagcccagcgaccccttttactattcagactttgacgggatcc tgggaatggcctacccaaacatggcagaggggaattcccctacagtaatg caggggatgctgcagcagagccagcttactcagcccgtcttcagcttcta cttcacctgccagccaacccgccagtattgtggagagctcatccttggag gtgtggaccccaactttattctggtcagatcatctggacccctgtcagcc cgtaactgtactggcagattgccatcgaggaatttgccatcggtaaccag gccactggcttgtgctctgagggttgccaggccattgtggataccgagac cttcctgc >NR_026551 1 tgtggcctgagaggacggccaggactggccagaaaagagagggacgtggc taaacgtgagggggcgtggccaagatggccgcgtgcgggatcctcgggta ccgggagcgaacgaggaggttctggctcagtgcatccactctgggagagc gtggacctggttcctgggggcgatcgccagtcacccatcaacattcggtg gagggacagtgtttatgatcccggcttaaaaccactgaccatctcttatg acccagccacctgcctccacgtctggaataatgggtactctttcctcgtg gaatttgaagattctacagataaatcagctgcacttagtgcattggaacg cagtcaaatttgaaaactttgaggatgcagcactggaagaaaatggtttg gctgtgataggagtatttttaaagatttcggaaacttctggcagcccagt gtctactggaaggcccaagccgcttgccagaaagctgcgccccgcccaaa agcactgggttctgcagtccaggcccttcctcagctcccaggtccaggag aactgcaaggtcacctacttccacaggaagcactgggtccgcatccggcc cctccgcaccactcctcccagctgggactacacccgcatctgcatccaga gagagatggtccccgcccgcatccgcgtcctgagagagatggtccccgag gcctggaggtgctttcccaacaggctgccgctgctgagcaacatcaggcc tgatttctccaaggctcccctggcctacgtgaagcggtggctttggaccg cccgccacccccacagcctgtccgcagcctggtgaccgtgaaaatcgccc cgccagagagcagaggaagcccgacgcccaggccatctgccttcaggtct gtgatgagaaacggagtggcctgttccgttgtgcccaggtctaggccgct gagcagagccctcactcccaggcagagttgtctgaatccttcct >NM_181640 2 atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag tgtgaaaggccacgtgaagatgctgcggctggatattatcaactcactgg taacaacagtattcatgctcatcgtatctgtgttggcactgataccagaa accacaacattgacagttggtggaggggtgtttgcacttgtgacagcagt atgctgtcttgccgacggggcccttatttaccggaagcttctgttcaatc ccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaagaagttttg taattttatattactttttagtttgatactaagtattaaacatatttctg tattcttccacatattttctgcagttattttaactcagtataggagctag aggaagagatttccgaagtctgcaccccgcgcagagcactactgtaactt ccaagggagcgctgggagcagcgggatcgggttttccggcacccgggcct gggtggcagggaagaatgtgccgggatccgcctcagggatctttgaatct ctttactgcctggctggccggcagctccg >NM_016951 3 atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag tgtgaaaggccacgtgaagatgctgcggctggcactaactgtgacatcta tgaccttttttatcatcgcacaagcccctgaaccatatattgttatcact ggatttgaagtcaccgttatcttatttttcatacttttatatgtactcag acttgatcgattaatgaagtggttattttggcctttgcttgatattatca actcactggtaacaacagtattcatgctcatcgtatctgtgttggcactg ataccagaaaccacaacattgacagttggtggaggggtgtttgcacttgt gacagcagtatgctgtcttgccgacggggcccttatttaccggaagcttc tgttcaatcccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaa gaagttttgtaattttatattactttttagtttgatactaagtattaaac atatttctgtattcttccacatattttctgcagttattttaactcagtat aggagctagaggaagagatttccgaagtctgcaccccgcgcagagcacta ctgtaacttccaagggagcgctgggagcagcgggatcgggttttccggca cccgggcctgggtggcagggaagaatgtgccgggatccgcctcagggatc tttgaatctctttactgcctggctggccggcagctccg >NR_002773 1 cagcaccacaccaggaccctccagaggctgtgagaaacatcctgcaccca ggtcctctctatctgtttatcattgtctattttgtattctgcattcagaa ccaagagcctgaagacgacccaggagctttagctatggctgtcttcatta ttttgtccctgtttagtgttctggtgacaggcatgggtgaaggtggggct gggagtgagaaaggaggtgagagggaatgtaagctgaaccagcttcccca ttgcccctccgtatctcccagtgcccagccttggacacaccctggccaga gccagctgtttgcagacctgagccgagaggagctgacggctgtgatgcgc tttctgacccagcagctggggccagggctggtggatgcagcccaggccca gccctcggacaactgtgtcttctcagtggagttgcagctgcctcccaagg ctgcagccctggctcacttggacagggggagccccccacctgcccgggag gcactggccatcgtcttctttggcaggcaaccccagcccaacgtgagtga gctggtggtggggccactgcctcacccctcctacatgcgggacgtgactg tggagcgtcatggaggccccctgccctatcaccgacgccccatgttgttc caagagtacctggacatagaccagatgatcttcgacagagagctgcccca ggcttctgggcttctccatcactgttgcttctacaagcgccggggacgga acctggtgacaatgaccacggctccccgtggtctgcaatcaggggaccgg gccacctagtttggcctctactacaacatctcgggcgctgggttcttcct gcaccacgtgggcttggagctgctagtgaaccacaaggcccttgaccctg cccgctggactatccagaaggtgttctatcaaggccgctactatgacagc ctggcccagctggaggcccagtttgaggccggcctggtgaatgtggtgct gatcccagacaatggcacaggtgggtcctggtccctgaagtcccctgtgc ccccgggtccagctccccctctgcagttccatccccaaggcccccgcttc agtgtccagggaagtcgagtggcctcctcactgtggactttctcctttgg cctcggagcattcagtggcccaaggatctttgacgttcccttccaagggg agagggtggcctatgaagtcagtgtccaggcggccttggccatctatgga ggcaattctccttctgctctacgaagccggtacatagatagtggctttgg cttgggccacttctccacgcccctgacccatggggtggactgcccctacc tggccacctacgtggactggcacttcctttttgagtcccaggccgccaag acaatacgcgatgccttttgtatatttgaacagaaccagggcctccccct gcggcgacaccactcagatctctactcccactactttgggggccttgcgg aaacggtgctggtcatcagatctgtgtctactatgctcaactatgactat gtgtgggatatggtcttccaccctaatggggccatagaaatcagactcca caccaccggctacatcagctcagcattcccctttggtgctgcccagaggt atggaaacaaagtttcagagcacaccctgggcacggtccacacccacagc gcccacttcaaggtggacctggatgtagcaggtaaggcatcctggcagag gcaaaagtgctggaggggtgagctgaagtctccatgcctagctttaaaag ttttcgttgggctgggagcagtagcttatgcctgtaagcccaacactttg ggagactgaggggggtggatcacttgaggtcaggagttcaaaaccagcct ggccaacatggcgaaatcctgtctgtactaaaaatacaaaaattagctgg gcatgggtatgctgtaatcctagctactcgggaggctgaggcaggagaat cacttgaatctgggagtcagaggttgcagtgagctgagattgagccactg cactccatcctgcgtgactgaac >NR_037806 1 attcccagtcacccactcactcagaaagccgggagtcatcggacaccttg ctggtcagaggtcctgggggtggttttgaaccatcagagcttggactttt ctgacttccccagcaaggatcttcccacttcctgctccctgtgttcccac cctccagtgttggcacaggcccacccctggctccaccagagccagaagca gaggtagaatcaggcgggccccgggctgcactccgagcagtgttcctggc catctttgctactttcctagagaacccggctgttgccttaaatgtgtgag agggacttggccaaggcaaaagctggggagatgccagtgacaacatacag ttcatgactaggtttaggaattgggcactgagaaaattctcaatatttca gagagtccttcccttatttgggactcttaacacggtatcctcgctagttg gttttaagggaaacactctgctcctgggtgtgagcagaggctctggtctt gccctgtggtttgactctccttagaaccaccgcccaccagaaacataaag gattaaaatcacactaataacccctggatggtcaatctgataataggatc agatttacgtctaccctaattcttaacattgcagctttctctccatctgc agattattcccagtctcccagtaacacgtttctacccagatcctttttca tttccttaagttttgatctccgtcttcctgatgaagcaggcagagctcag aggatcttggcatcacccaccaaagttagctgaaagcagggcactcctgg ataaagcagcttcactcaactctggggaatgctaccattttttttccaaa gtagaaaggaagcacttctgagccagtgaccactgaaagatgaacactct tcctgatcctctcctctagaattcatctcctcctgctagcagccgcgtcc tggaggagcagcggatggggaatccattctgtttcttcctggtgtttagg aagttgccccacacacagattgccccgatgtccaaccagaagaagtgaaa ctgctgctgggtctggagaggtgaagacccgtggccagcttctgttgttg ccatcggccattgctttttgttcgcttgcttttggttttgcaagaagagc ggcctctgtctctgatctgcttcaaatcatcattccatcagtgacagaag tggctgttccatcagtggtcgcagccagttcagctcctgcatccatcccc aagtgttctgagtggaatttgaggcctccccaaccacctaccaaaaaagg agggtgaaatgaaaggaagaagaaaaactcagcattctttcctctgacaa agagtaaaacgacaaggaatatcggcctgaattctcttcccaagaagaaa gaaagcacaccaacgcaggcatttgtcttctgtccatggtgctgaagttt attcactttcaaaccactttcagtaacagcaaattctttagaaaaggaaa atacagggaaagggataaacctcactgacttggaggaaatcaagaggagt gagcacagcatcagaaagccccctggccccagactgcacccgctttcctg gccctaccttgaaatccatcaggtctgcgttggacacggcattgtacatg ggattagctctg

Any help and input would be deeply appreciated.

Thank you for taking the time to go through my problem!

Answer1:

Rather than splitting the sequence into three parts, the way I see this working is to find all occurrences of $pattern in the complete sequence and determine in which third the pattern starts.

The built-in variable $-[0] contains the offset of the start of the most recent successful match.

The code below does what I think you want. It works by accumulating each sequence (which ends either when a new sequence ID is found or the end of file is reached) and passing it to the process_seq subroutine.

The subroutine takes the length of the sequence and caclulates the offset of the end of each third of the string. The idiomatic sprintf '%.0f', $value is used to round fractional values to the nearest character position.

The @counts array is adjusted for each occurrence of $regex in the sequence. The element of @counts to be incremented is established by comparing the starting position of the match in $-[0] with the end offset of each of the three segments of the sequence.

Once each sequence has been processed the values in @counts are accumulated into @totals to give overall figures for all sequences.

The output of the program when using your sample data is shown. The grand total is (9, 1, 6).

use strict; use warnings; my $gpat = '[G]{3,5}'; my $npat = '[A-Z]{1,25}'; my $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat; my $regex = qr/$pattern/i; open my $fh, '<', 'sequences.txt' or die $!; my ($id, $seq); my @totals = (0, 0, 0); while (<$fh>) { chomp; if (/^>(\w+)/) { process_seq($seq) if $id; $id = $1; $seq = ''; print "$id\n"; } elsif ($id) { $seq .= $_; process_seq($seq) if eof; } } print "Total: @totals\n"; sub process_seq { my $sequence = shift; my $length = length $sequence; my @offsets = map {sprintf '%.0f', $length * $_ / 3} 1..3; my @counts = (0, 0, 0); while ($sequence =~ /$regex/g) { my $place = $-[0]; for my $i (0..2) { next if $place >= $offsets[$i]; $counts[$i]++; last; } } print "@counts\n\n"; $totals[$_] += $counts[$_] for 0..2; }

<strong>output</strong>

NR_037701 0 0 1 NM_198399 1 0 0 NR_026816 1 0 1 NR_027917 0 0 0 NR_002777 0 0 0 NR_033769 1 0 0 NM_016326 1 0 1 NM_181641 1 0 1 NM_001144931 0 0 0 NR_029429 0 1 0 NR_026551 1 0 0 NM_181640 1 0 1 NM_016951 1 0 1 NR_002773 1 0 0 NR_037806 0 0 0 Total: 9 1 6

Answer2:

I lifted Borodin's process_seq function but used Bio:SeqIO to read in the file sequence by sequence, an advantage over manually reading line by line and the logic to determine various processing. I believe those advantages are:

<ul><li>Code that has been developed and tested by many others</li> <li>Whenever possible, if output is done via the Bio::SeqIO module, the result file can then be read using Bio::SeqIO read (next_seq) method.</li> <li>Other reasons I can't think of now :-)</li> </ul>

I imagine the BioPerl package of Bio Genetic code modules must be overwhelming to a biologist beginning programming. He might not be willing to try to dig out the information he needs to begin building a program. <a href="http://www.bioperl.org/wiki/Main_Page" rel="nofollow">BioPerl wiki</a> is a good starting place, especially the Howto section, and then there's a how to for beginners and others. You'll find code examples which are mostly(?) helpful. <a href="http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/Seq.pm" rel="nofollow">Bio::Seq</a> has some good code examples in the beginning and is where most of the general sequence functions are. Also, for input/output, the <a href="http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/SeqIO.pm" rel="nofollow">Bio::SeqIO</a> module is used and it has examples at the beginning of it's manual.

#!/usr/bin/perl use strict; use warnings; use Bio::SeqIO; my $gpat = '[G]{3,5}'; my $npat = '[A-Z]{1,25}'; my $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat; my $regex = qr/$pattern/i; my $in = Bio::SeqIO->new ( -file => "fasta_dat.txt", -format => 'fasta'); my @totals; while ( my $seq = $in->next_seq() ) { process($seq); } print "Totals: "; print "@totals\n"; sub process { my $seq = shift; my @offset = map {sprintf '%.0f', $seq->length * $_ / 3} 1..3; my $sequence = $seq->seq; my @count = (0,0,0); while ($sequence =~ /$regex/g) { my $place = $-[0]; for my $i (0 .. 2) { next if $place >= $offset[$i]; $count[$i]++; last; } } print $seq->id, "\n@count\n"; $totals[$_] += $count[$_] for 0 .. $#count; }

Recommend

  • Java Embed Activity in BPEL sharing instance
  • How to autoselect option depending on input's value
  • Promisify a synchronous method
  • loading .json files generates 404 errors
  • dynamic javascript data with qtip is overriding all tooltips with same message
  • MRI and YARV Ruby implementations - what happened in Ruby 1.9?
  • Java 9, compatability issue with ClassLoader.getSystemClassLoader
  • git tries to stat //HEAD when searching for a repo, leading to huge delays on Cygwin
  • Shader Materials and GL Framebuffers in THREE.js
  • Trying to start Chrome from WSGI/Python (admittedly quick-and-dirty) [Win. XP, x86, Python 2.7, Apac
  • Which .NET framework version will be included in Windows 7? [closed]
  • Dynamic text fields in iPhone are possible or not?
  • Read and write file bit by bit
  • Having a MaskedTextBox only accept letters
  • python: using raw socket with OSX
  • How to install R on a linux cluster?
  • Unable to import python-mysqldb
  • How to mock Google API AndroidPublisher request
  • MongoDB: Adding an array into an existing array
  • How do I write to registers in hardware using Python?
  • Exporting SAS DataSet on to UNIX as a text file…with delimiter '~|~'
  • How to customize whisker lines on a geom_box plot differently than the lines of the box itself
  • Extract data between rows r
  • Fraction length
  • Runtime error in UVA Online Judge [closed]
  • JSON - slashes not escaping
  • Inversing an interpolation of rotation
  • How does this usort cmp function actually work?
  • MySQL WHERE-condition in procedure ignored
  • How to handle AllServersUnavailable Exception
  • Trying to switch camera back to front but getting exception
  • Do I've to free mysql result after storing it?
  • How to get next/previous record number?
  • align graphs with different xlab
  • Return words with double consecutive letters
  • Free memory of cv::Mat loaded using FileStorage API
  • Angular 2 constructor injection vs direct access
  • Programmatically clearing map cache
  • Reading document lines to the user (python)
  • Python/Django TangoWithDjango Models and Databases