首页 文章

使用Perl和REGEX连接FASTA文件中的样本的多个序列

提问于
浏览
0

我有超过200个多序列fasta文件,并且在每个fasta文件中,有一些序列可供选择基因的数百个样本(即样本输入fasta文件中的PF3D7_1467550) . fasta文件中的大多数样本(即样本303.1-样本输入文件中的第一个序列)具有一个序列,但是其他样本(即IGS-MLW-089sA和IGS-MWI-254sA)具有需要连接的基因的多个序列一起 .

示例输入fasta文件

>303.1_assembled_PF3D7_1475500.[1:126].sp.tr
MHHLLFIIWYIILNYYVSGQESATNFYKFIDSFASSTYISEESGSSAYDAKRAIQNNPNY
EEKKTYDEELKESKEKANDLNNKLSLLTSVNVNTLDSDILKLGILPGDSYNFPANDCAVI
KNVQ

>IGS-MLW-089sA_assembled_PF3D7_1475500.[1:61].sp.tr
MHHLLFIIWYIILNYYVSGQESATNFYKFIDSFASSTYISEESGSSAYDAKRAIQNNPNY
>IGS-MLW-089sA_assembled_PF3D7_1475500.[65:126].sp.tr
TYDEELKESKEKANDLNNKLSLLTSVNVNTLDSDILKLGILPGDSYNFPANDCAVIKNVQ

>IGS-MWI-254sA_assembled_PF3D7_1475500.[1:61].sp.tr
MHHLLFIIWYIILNYYVSGQESATNFYKFIDSFASSTYISEESGSSAYDAKRAIQNNPNY
>IGS-MWI-254sA_assembled_PF3D7_1475500.[65:119].sp.tr
TYDEELKESKEKANDLNNKLSLLTSVNVNTLDSDILKLGILPGDSYNFPANDC

期望的输出:

>303.1_assembled_PF3D7_1475500.[1:126].sp.tr
 MHHLLFIIWYIILNYYVSGQESATNFYKFIDSFASSTYISEESGSSAYDAKRAIQNNPNY
 EEKKTYDEELKESKEKANDLNNKLSLLTSVNVNTLDSDILKLGILPGDSYNFPANDCAVI
 KNVQ

 >IGS-MLW-089sA_assembled_PF3D7_1475500.[1:61][65:126].sp.tr
 MHHLLFIIWYIILNYYVSGQESATNFYKFIDSFASSTYISEESGSSAYDAKRAIQNNPNY
 TYDEELKESKEKANDLNNKLSLLTSVNVNTLDSDILKLGILPGDSYNFPANDCAVIKNVQ

 >IGS-MWI-254sA_assembled_PF3D7_1475500.[1:61][65:119].sp.tr
 MHHLLFIIWYIILNYYVSGQESATNFYKFIDSFASSTYISEESGSSAYDAKRAIQNNPNY
 TYDEELKESKEKANDLNNKLSLLTSVNVNTLDSDILKLGILPGDSYNFPANDC

我相信来自另一张票的代码可能很有用 .

%hash;
 while (<DATA>) {
    if (/^>(miRNA\d+)/) {
            $hash{$1}[0] = $_;
            chomp($n = <DATA>);
            unshift @{$hash{$1}[1]}, $n;
    }
}

for $k (sort keys %hash) {
    print $hash{$k}[0], join(',', @{$hash{$k}[1]}), "\n";
 }

这是上一张票的链接:

I need search a pattern in a header line of my file and concatenates the next line with Perl

我正在寻找帮助来修改以下部分代码来处理选择sampleID或替代建议 .

/^>(miRNA\d+)/

谢谢

2 回答

  • 3

    如果要连接的样本是相邻的,则可以只收集范围(例如 [1:61] )和要在两个数组中打印的行 .

    #!/usr/bin/perl
    use warnings;
    use strict;
    
    sub without_ranges {
        my ($header) = @_;
        ( my $without = $header ) =~ s/\[[^\]]+\]//g;
        return $without
    }
    
    sub output {
        my ($header, $ranges, $buffer) = @_;
        my $header_with_ranges = $header;
        $header_with_ranges =~ s/(.*\])/$1\[$_]/ for @$ranges;
        print $header_with_ranges, @$buffer;
    }
    
    
    my (@buffer, @ranges);
    my $header = "";
    
    while (<>) {
        if (/^>/) {
            my $new_header = $_;
            if (without_ranges($new_header) eq without_ranges($header)) {
                push @ranges, $new_header =~ /\[([^\]]+)\]/;
    
            } else {
                output($header, \@ranges, \@buffer) if $header;
                $header = $new_header;
                @buffer = @ranges = ();
            }
            last if eof;
    
        } else {
            push @buffer, $_;
        }
    }
    output($header, \@ranges, \@buffer);
    
  • -1

    来自另一张票的代码没有那么有用,而且有点......不够理想,坦率地说 .
    这是一个可能的解决方案,假设您总是有范围 [x:y] .

    use strict; use warnings;
    
    my (%hash,$key,$start
        );
    while(<DATA>) {
        chomp;
        if (m{^(>.*?)(?:\[(\d+):(\d+)\]\.sp\.tr)?$}) {
            ($key,$start)=($1,$2);
            next;
        }
        $hash{$key}{$start}.=$_;
    }
    
    for my $key (sort keys %hash) {
          my $keyref=$hash{$key};
          printf "%ssp.tr\n%s\n", $key, join (''
                                    , map { $keyref->{$_}} sort {$a<=>$b} keys %$keyref
                                    );
    }
    
    __DATA__
    >303.1_assembled_PF3D7_1475500.[1:126].sp.tr
    MHHLLFIIWYIILNYYVSGQESATNFYKFIDSFASSTYISEESGSSAYDAKRAIQNNPNY
    EEKKTYDEELKESKEKANDLNNKLSLLTSVNVNTLDSDILKLGILPGDSYNFPANDCAVI
    KNVQ
    
    >IGS-MLW-089sA_assembled_PF3D7_1475500.[1:61].sp.tr
    MHHLLFIIWYIILNYYVSGQESATNFYKFIDSFASSTYISEESGSSAYDAKRAIQNNPNY
    >IGS-MLW-089sA_assembled_PF3D7_1475500.[65:126].sp.tr
    TYDEELKESKEKANDLNNKLSLLTSVNVNTLDSDILKLGILPGDSYNFPANDCAVIKNVQ
    
    >IGS-MWI-254sA_assembled_PF3D7_1475500.[1:61].sp.tr
    MHHLLFIIWYIILNYYVSGQESATNFYKFIDSFASSTYISEESGSSAYDAKRAIQNNPNY
    >IGS-MWI-254sA_assembled_PF3D7_1475500.[65:119].sp.tr
    TYDEELKESKEKANDLNNKLSLLTSVNVNTLDSDILKLGILPGDSYNFPANDC
    

    >303.1_assembled_PF3D7_1475500.sp.tr
    MHHLLFIIWYIILNYYVSGQESATNFYKFIDSFASSTYISEESGSSAYDAKRAIQNNPNYEEKKTYDEELKESKEKANDLNNKLSLLTSVNVNTLDSDILKLGILPGDSYNFPANDCAVIKNVQ
    >IGS-MLW-089sA_assembled_PF3D7_1475500.sp.tr
    MHHLLFIIWYIILNYYVSGQESATNFYKFIDSFASSTYISEESGSSAYDAKRAIQNNPNYTYDEELKESKEKANDLNNKLSLLTSVNVNTLDSDILKLGILPGDSYNFPANDCAVIKNVQ
    >IGS-MWI-254sA_assembled_PF3D7_1475500.sp.tr
    MHHLLFIIWYIILNYYVSGQESATNFYKFIDSFASSTYISEESGSSAYDAKRAIQNNPNYTYDEELKESKEKANDLNNKLSLLTSVNVNTLDSDILKLGILPGDSYNFPANDC
    

相关问题