Question

我有一个文本文件IDs.txt，每行包含一个唯一的ID字符串，例如：

foo
bar
someOtherID

我知道这些ID中的一些是在另外2个文件中的{strong>一个或两个中找到的，这些文件的数据行格式不同，分别为1.txt和{{1} }

2.txt

1.txt
id=foo
name=example
age=81
end
id=notTheIDYouAreLookingFor
name=other
age=null

具体的数据格式并不重要，因为我需要回答的是“两个ID中都包含哪些ID？”，实际上，理想情况下，我需要与格式无关的解决方案

在示例中，我想找到带有2.txt <Data> <ID>foo</ID> <Stuff>Some things</Stuff> </Data> <Data> <ID>bar</ID> <Stuff>Other things</Stuff> </Data>的行：

foo <ID>foo</ID>

有效地：this question，但针对2个文件而不是1个文件，对较大的ID列表进行grep操作，然后找到常见的匹配项。

Answer 1

由于您只想查找两个文件（f1和f2）中的ID，因此不必解析ids.txt：

awk 'NR==FNR{a["<ID>"$1"</ID>"]="id="$1;next}
    a[$0]{print $0,a[$0]}' <(grep -oP 'id=\K.*' f1) f2

上述一线输出：

<ID>foo</ID> id=foo

Answer 2

这是GNU awk的其中之一，远非完美：

$ awk '
NR==FNR {                                      # store file1 entries to a[1]
    a[ARGIND][$0]
    next
}
match($0,/([iI][dD][>=])([^<]+)/,arr) {        # hash on whats =after or >between<
    a[ARGIND][arr[2]]=$0                       # store whole record. key on above
}
END {
    for(i in a[1])                             # get keywords from first file
        if((i in a[2]) && (i in a[3]))         # if found in files 2 and 3
            print a[2][i],a[3][i]              # output
}' file1 file2 file3

输出：

id=foo <ID>foo</ID>

Answer 3

我不是awk专家，所以当单人行事的时候，我倾向于将事情分解成大块。

我将假设您已经考虑到前面的评论，即简单的格式独立解决方案不太可能实现。取而代之的是，我采用了在脚本内部记录格式并标准化两种输入格式的方法。如果出现第三种格式，则只需修改脚本以记录并标准化该新格式即可。

$ cat << EOF > work.sh
#!/usr/bin/env bash

# 1.txt has IDs in the form id=....

grep -x 'id=.*' 1.txt | sed -e 's/^id=//' | sort > 1.txt.ids

# 2.txt has IDs in the form <ID>...</ID>

grep -x '^<ID>.*</ID>' 2.txt | sed -Ee 's-^<ID>(.*)</ID>-\1-' | sort > 2.txt.ids

comm -12 1.txt.ids 2.txt.ids  | grep -xf IDs.txt
EOF

第一个grep命令从1.txt提取完全由'id = something'组成的行，然后剥离'id ='并将其分类到文件1.txt.ids中。

第二个grep对2.txt中完全由' something '组成的行执行类似的操作，然后剥离开和关ID标记，并将ID排序为2.txt .ids。

然后使用

comm仅显示两个文件中出现的行，并且comm的输出进一步由IDs.txt过滤，IDs.txt是您感兴趣的特定ID的列表内。

$ cat 1.txt  
id=foo
name=example
age=81
end
id=notTheIDYouAreLookingFor
name=other
age=null
$ cat 2.txt
<Data>
<ID>foo</ID>
<Stuff>Some things</Stuff>
</Data>
<Data>
<ID>bar</ID>
<Stuff>Other things</Stuff>
</Data>
$ cat IDs.txt
foo
bar
someOtherID
$ bash work.sh
foo

仅从列表中找到两个数据文件中都存在的ID

3 个答案: