首页 文章

如何在bash中比较2个范围列表?

提问于
浏览
3

使用bash脚本(Ubuntu 16.04),我试图比较2个范围列表:file1中任何范围内的任何数字是否与file2中任何范围内的任何数字一致?如果是这样,请在第二个文件中打印该行 . 这里我将每个范围作为2个制表符分隔的列(在file1中,第1行表示范围1-4,即1,2,3,4) . 真实的文件非常大 .

文件1:

1 4
5 7 
8 11
12 15

文件2:

3 4 
8 13 
20 24

期望的输出:

3 4 
8 13

我最好的尝试是:

awk 'NR=FNR { x[$1] = $1+0; y[$2] = $2+0; next}; 
{for (i in x) {if (x[i] > $1+0); then
{for (i in y) {if (y[i] <$2+0); then            
{print $1, $2}}}}}' file1 file2 > output.txt

这将返回一个空文件 .

我认为脚本需要使用if-then条件进行范围比较,并遍历两个文件中的每一行 . 我找到了每个概念的例子,但无法弄清楚如何将它们结合起来 .

任何帮助赞赏!

6 回答

  • 0

    对于GNU awk,因为我正在控制 for 扫描顺序以优化时间:

    $ cat program.awk
    BEGIN {
        PROCINFO["sorted_in"]="@ind_num_desc"
    }
    NR==FNR {                                         # hash file1 to a
        if(($2 in a==0) || $1<a[$2])                  # avoid collisions
            a[$2]=$1
        next
    }
    {
        for(i in a) {                                 # in desc order
            # print "DEBUG: For:",$0 ":", a[i], i     # remove # for debug
            if(i+0>$1) {                              # next after
                if($1<=i+0 && a[i]<=$2) {
                    print
                    next
                }
            }
            else
                next
        }
    }
    

    测试数据:

    $ cat file1
    0 3 # testing for completely overlapping ranges
    1 4
    5 7 
    8 11
    12 15
    $ cat file2
    1 2 # testing for completely overlapping ranges
    3 4 
    8 13 
    20 24
    

    输出:

    $ awk -f program.awk file1 file2
    1 2
    3 4 
    8 13
    

    $ awk -f program.awk file2 file1
    0 3
    1 4
    8 11
    12 15
    
  • 4

    如果Perl解决方案是首选,那么下面的单行程将起作用

    /tmp> cat marla1.txt
    1 4
    5 7
    8 11
    12 15
    /tmp> cat marla2.txt
    3 4
    8 13
    20 24
    /tmp> perl -lane ' BEGIN { %kv=map{split(/\s+/)} qx(cat marla2.txt) } { foreach(keys %kv) { if($F[0]==$_ or $F[1]==$kv{$_}) { print "$_ $kv{$_}" }} } ' marla1.txt
    3 4
    8 13
    /tmp>
    
  • 2
    awk 'FNR == 1 && NR == 1 { file=1 } FNR == 1 && NR != 1 { file=2 } file ==1 { for (q=1;q<=NF;q++) { nums[$q]=$0} } file == 2 { for ( p=1;p<=NF;p++) { for (i in nums) { if (i == $p) { print $0 } } } }' file1 file2
    

    分解:

    FNR == 1 && NR == 1 { 
                      file=1 
                      }
    FNR == 1 && NR != 1 { 
                      file=2 
                      }
    file == 1 { 
               for (q=1;q<=NF;q++) { 
                          nums[$q]=$0
                    } 
              }
    file == 2 {
          for ( p=1;p<=NF;p++) {
             for (i in nums) {
                 if (i == $p) {
                          print $0
                 }
              }
          }
    }
    

    基本上我们在处理第一个文件时设置file = 1,在处理第二个文件时设置file = 2 . 当我们在第一个文件中时,将该行读入键入该行的每个字段的数组中 . 当我们在第二个文件中时,处理数组(nums)并检查该行上每个字段是否有条目 . 如果有,请打印 .

  • 0

    awk 解决方案:

    awk 'NR==FNR{ a[$1]=$2; next }
         { for(i in a) 
               if (($1>=i+0 && $1<=a[i]) || ($2<=a[i] && $2>=i+0)) { 
                   print i,a[i]; delete a[i];
               } 
         }' file2 file1
    

    输出:

    3 4
    8 13
    
  • 1

    当然,这取决于你的文件有多大 . 如果它们不足以耗尽内存,您可以尝试这种100%bash解决方案:

    declare -a min=() # array of lower bounds of ranges
    declare -a max=() # array of upper bounds of ranges
    
    # read ranges in second file, store then in arrays min and max
    while read a b; do
        min+=( "$a" );
        max+=( "$b" );
    done < file2
    
    # read ranges in first file    
    while read a b; do
        # loop over indexes of min (and max) array
        for i in "${!min[@]}"; do
            if (( max[i] >= a && min[i] <= b )); then # if ranges overlap
                echo "${min[i]} ${max[i]}" # print range
                unset min[i] max[i]        # performance optimization
            fi
        done
    done < file1
    

    这只是一个起点 . 有许多可能的性能/内存占用改进 . 但它们在很大程度上取决于文件的大小和范围的分布 .

    EDIT 1 :改进了范围重叠测试 .

    EDIT 2 :重复使用RomanPerekhrest提出的优秀优化(未设置已打印的范围来自 file2 ) . 当范围重叠的概率很高时,性能应该更好 .

    EDIT 3 :与RomanPerekhrest提出的 awk 版本的性能比较(修复了最初的小错误之后): awk 在此问题上比 bash 快10到20倍 . 如果表现很重要且你在 awk 和_37471之间犹豫不决,请选择:

    awk 'NR == FNR { a[FNR] = $1; b[FNR] = $2; next; }
        { for (i in a)
              if ($1 <= b[i] && a[i] <= $2) {
                  print a[i], b[i]; delete a[i]; delete b[i];
              } 
        }' file2 file1
    
  • 1

    如果范围是根据其下限排序的,我们可以使用它来使算法更有效 . 这个想法是,如果file1的某些间隔低于file1当前观察到的间隔,它们肯定低于file1中的下一个间隔,因此我们不需要检查与较低间隔的交点 .

    #!/bin/bash
    
    exec 3< "$1"  # file whose ranges are checked for overlap with those ...
    exec 4< "$2"  # ... from this file, and if so, are written to stdout
    
    l4=-1  # lower bound of current range from file 2 
    u4=-1  # upper bound
    # initialized with -1 so the first range is read on the first iteration
    
    echo "Ranges in $1 that intersect any ranges in $2:"
    while read l3 u3; do
      if (( u4 >= l3 )); then
        (( l4 <= u3 )) && echo "$l3 $u3"
      else  # the upper bound from file 2 is below the lower bound from file 1
        while read l4 u4; do
          if (( u4 >= l3 )); then
            (( l4 <= u3 )) && echo "$l3 $u3"
            break
          fi
        done <&4
      fi
    done <&3
    

    可以使用 ./script.sh file2 file1 调用该脚本

相关问题