如何打印包含特定数字的行,但不包含以前从未出现过的其他数字?

时间:2013-11-23 00:56:39

标签: bash awk grep

我有一个包含许多数字的文件,用10个前导数字写入,之前暂时放置“A”,之后放置“Z”,以确保脚本不会错误识别数字的开头和结尾。 E.g:

A00000000001Z
A00000000003Z,A00000000004Z;A00000000005Z
A00000000004Z A00000000005Zsome wordsA00000000001Z
A00000000006Z;A00000000005Z
A00000000001Z

我需要搜索特定的数字,但只输出那些找到数字的行,但是之前从未出现的其他数字都不在同一行。

例如,如果我搜索“0000000001”,它将打印第1,3和5行:

A00000000001Z
A00000000004Z A00000000005Zsome wordsA00000000001Z
A00000000001Z

它可以打印第3行,因为其他数字“00000000004”和“00000000005”之前出现在第2行。

如果我搜索“00000000005”,则会打印第3行:

A00000000004Z A00000000005Zsome wordsA00000000001Z

它不会打印第2行,因为之前从未出现过其他数字“00000000003”和“00000000004”。

到目前为止,我已经解决了这个问题:

# search for the line and print the previously appearing lines to a temporary file
grep -B 10000000 0000000001 file.txt > output.temp

# send the last line to another file
cat output.temp | tail -1 > output.temp1
sed -i '$ d' output.tmp > output.temp2

# search for numbers appearing in output.temp2
for i in 1 .. 1000000 NOT original number
     a=`printf $010d $i`
     if [ $a FOUND in output.temp2]
     then
          # check if was found in the previous line
          if [ $a NOT FOUND in output.temp1]
          else

          fi    
     fi
done < ./file.txt

如何只打印出包含特定数字的那些行,同时排除以前从未出现在文件中的其他数字?

1 个答案:

答案 0 :(得分:1)

不是严格意义上的bash,但是在Python2中你可以从shell运行:

#!/usr/bin/env python

import re
import sys

def find_valid_ids(input_file, target_id):
    with open(input_file) as f:
        found_ids = set()
        for line in f.readlines():
            ids = set(re.findall(r'A\d+Z', line))
            if (target_id in ids and
                (len(ids - found_ids) == 0 or
                 (len(ids) == 1 and target_id in ids))):
                print line.strip('\n')
            found_ids |= ids

if __name__ == "__main__":
    try:
        find_valid_ids(sys.argv[1], sys.argv[2])
    except IndexError as e:
        print 'Usage: ./find_valid_ids.py input_file target_id'

因此,如果您将上述内容保存为find_valid_ids.py,则$ chmod +x find_valid_ids.py并将其设置为$ ./find_valid_ids.py your_input_file.txt A00000000001Z