Question

假设我有一个可预测的文本文档，其中包含一些称为X:的ID和已知的属性组合，例如：具有已知实例数的类别Y:（例如系列中每个Y:后始终只有1 X:）：

  X:37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"

我想检索所有蓝色物品的物品ID列表。我不在乎是否有重复的ID，只是文档中有哪些ID值。然后我想对列表进行排序，并与具有完全相同结构的另一个结构化文本doc中的蓝色ID列表进行比较（＆＃34;两个文档共有哪些蓝色内容？＆＃34;＆＃34;哪些蓝色内容在doc 1中但在doc 2中没有？＆＃34;）。

我知道我可以grep轻松地查找所有Y:BLUE行，但是我需要在每个此类实例中找到前一个X:所需的附加命令，并传递排序结果列表到diff？自从AmiShell以来，我还没有集中使用命令行......对不起:-(是否有在线使用此类用例的食谱？

Answer 1

让我们考虑您有以下2个输入文件：

$ more doc*
::::::::::::::
doc1
::::::::::::::
doc 1
  X:1
#  more data pertaining to item 37
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:2
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:3
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:RED
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:4
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
::::::::::::::
doc2
::::::::::::::
doc 2
  X:4
#  more data pertaining to item 37
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:3
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:2
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:RED
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:1
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"

您可以在每个文档上使用以下awk命令来获取ID：

$ awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1
1
2
4

$ awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2
1
3
4

说明：

-F':'将:定义为字段分隔符：
/X:[0-9]+$/{tmp=$2}会在tmp变量中保存ids的值（假设ids只由数字组成，而且线上没有其他内容），如果你不是这样的话可以调整过滤正则表达式/X:[0-9]+$/以满足您的需求
/Y:BLUE$/{a[NR]=tmp}当我们到达具有模式Y:BLUE的行（假设：EOL紧跟BLUE之后）时，我们将保存在tmp中的值添加到数组中
在处理结束时我们对数组进行排序并打印它，请注意您在awk

awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{print tmp}' | sort -n

然后您可以按以下方式将它们组合起来，找出两个文档之间蓝色ID的区别：

$ diff <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1) <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2)                                                    
2c2
< 2
---
> 3

或找到它们之间共同的蓝色ID：

$ comm -1 -2 <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1) <(awk -F':' '/X:[
0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2)                                              
1
4

从命令行查找快速结构化文本数据？

1 个答案: