首页 文章

找到DNA序列中所有重复的4聚体 - Perl

提问于
浏览
4

你好,

我尝试编写一个程序,读取包含多个DNA序列的FASTA格式文件,识别序列中所有重复的4聚体(即,多次出现的所有4聚体),并打印出重复的4聚体以及查找它的序列的 Headers . k聚体仅仅是k个核苷酸的序列(例如,“aaca”,“gacg”和“tttt”是4聚体) .

这是我的代码:

use strict;
use warnings;

my $count = -1;
my $file = "sequences.fa";
my $seq = '';
my @header = ();
my @sequences = ();
my $line = '';
open (READ, $file) || die "Cannot open $file: $!.\n";

while ($line = <READ>){
    chomp $line;
    if ($line =~ /^>/){
        push @header, $line;
        $count++;
        unless ($seq eq ''){
            push @sequences, $seq;
            $seq = '';
        }
    } else {
        $seq .= $line;
    }
}   push @sequences, $line;

for (my $i = 0; $i <= $#sequences+1; $i++){
    if ($sequences[$i] =~ /(....)(.)*\g{1}+/g){
        print $header[$i], "\n", $&, "\n";
    }
}

我有两个请求:首先,我不知道如何设计我的正则表达式模式以获得所需的输出 . 第二,不太重要的是,我确信我的代码效率非常低,所以如果有办法缩短代码,请告诉我 .

提前致谢!

以下是FASTA文件的示例:(请注意,序列之间有一条额外的行,原始的fasta文件不是这种情况)

NC_001422.1肠杆菌噬菌体意义上的PhiX174拉托,完整基因组GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTttttttCGGATATTTCTGATGAGTCGAAAAAT CCCTTACTTGAGGATAtatataAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCT> NC_001501.1肠杆菌噬菌体phiX184意义上拉托,完整基因组AACGGCTGGTCAGTATTTAAGGTTAGTGCTGAGGTTGACTACATCTGTTTTTAGAGACCCAGACCTTTTA TCTCACTTCTGTTACTCCAGCTTCTTCGGCACCTGTTTTACAGACACCTAAAGCTACATCGTCAACGTTA TATTTTGATAGTTTGACGGTTAATGCTGGTAATGGTgagagagaGGTTTTCTTCATTGCATTCAGATGGA TCAACGCCGCTAATCAGGTTGTTTCTGTTGGTGCTGATATTGCTTTTGATGCCGACCCTAAATTTTTTGC CTGTTTGGTTCGCTTTGAGTCTTCTTCGGTTCCGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTG> NC_001622.5肠杆菌噬菌体phiX199意义上拉托,完整基因组TTCGCTGAATCAGGTTATTAAAGAGTTGCCGAGATATTTATGTTGGTTTCATGCGGATTGGTCGTTTAAA TTGGACTTGGTGGCAAGTCTGCCGCTGATAAAGGAAAGGATAATGACCAAATCAAAGAACTCGTGATTAT CTTGCTGCTGCATTTCCTGAGCTTAATGCTTGGGAGCGTGCTGGTGCTGATGCTTCCTCTGCTGGTATGG TTGACGCCGGATTTGAGAATCAAAAATGTGAGAGAGCTTACTAAAATGCAACTGGACAATCAGAAAGAGA GATGCAA AATGAGACTCAAAAAGAGATTGCTGGCATTCAGTCGGCGACTTCACGCCAGAATACGAAAGAC CAGGTATATGCACAAAATGAGATGCTTGCTTATCAACAGAAGGAGTCTACTGCTCGCGTTGCGTCTATTA TGGAAAACACCAATCTTTCCAAGCAACAGCAGGTTTCCGAGATTATGCGCCAAATGCTTACTCAAGCTCA AACGGCTGGTCAGTATTTTACCAATGACCAAATCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGAC TTAGATGAGTGTTCATCAGCAAACGCAGAATCAGCGGTATGGCTCTTCTCATATTGGCGCTACTGCAAAG

1 回答

  • 5

    我可能更喜欢解决你的问题:

    #!/usr/bin/env perl
    
    use strict;
    use warnings;
    
    use Data::Dumper;
    
    #set paragraph mode. Iterate on blank lines. 
    local $/ = ''; 
    
    #read from STDIN or a file specified on command line, 
    #e.g. cat filename_here | myscript.pl
    #or myscript.pl filename_here
    while ( <> ) {
       #capture the header line, and then remove it from our data block
       my ($header) = m/\>(.*)/;
       s/>.*$//;
    
       #remove linefeeds and whitespace. 
       s/\s*\n\s*//g;
       #use lookahead pattern, so the data isn't 'consumed' by the regex. 
       my @sequences = m/(?=([atcg]{4}))/gi;
    
       #increment a count for each sequence found. 
       my %count_of;
       $count_of{$_}++ for @sequences;
    
       #print output. (Modify according to specific needs. 
       print $header,"\n";
    
       print "Found sequences:\n";
       print Dumper \@sequences;
       print "Count:\n";
       print Dumper \%count_of;
    
       #note - ordered, but includes duplicates. 
       #you could just use keys  %count_of, but that would be unordered. 
       foreach my $sequence ( grep { $count_of{$_} > 1 } @sequences ) {
          print $sequence, " => ", $count_of{$sequence},"\n";
       }
       print "\n";
    }
    

    我们按记录迭代记录,捕获并删除“ Headers ”行,然后将其余部分拼接在一起 . 然后捕获4的每个(重叠)序列,并对它们进行计数 .

    这样,对于您的样本数据(简洁的第一节):

    NC_001422.1 Enterobacteria phage phiX174 sensu lato, complete genome 
    Found sequences:
        GAGT => 2
        AGTT => 2
        TTAT => 2
        CATG => 2
        ATGA => 3
        TGAC => 2
        CGCA => 2
        AGTT => 2
        ACTT => 2
        tttt => 3
        tttt => 3
        tttt => 3
        GGAT => 2
        GATA => 2
        ATAT => 2
        TATT => 2
        ATGA => 3
        TGAG => 2
        GAGT => 2
        AAAA => 2
        AAAA => 2
        ACTT => 2
        TGAG => 2
        GGAT => 2
        GATA => 2
        tata => 2
        tata => 2
        TTAT => 2
        TATG => 2
        ATAT => 2
        TATT => 2
        GCCG => 2
        TATG => 2
        GCCG => 2
        CGCA => 2
        CATG => 2
        ATGA => 3
        TGAC => 2
    

    注意 - 因为它基于原始序列,它基于数据中的排序,你会看到TGAC两次因为......它在那里两次 .

    但是你可以改为:

    foreach my $sequence ( sort { $count_of{$b} <=> $count_of{$a} }
                              grep { $count_of{$_} > 1 } 
                                     keys %count_of ) {
          print $sequence, " => ", $count_of{$sequence},"\n";
       }
       print "\n";
    

    哪个将丢弃任何少于2个匹配,并按频率排序 .

相关问题