从命令行查找快速结构化文本数据?

时间:2018-06-03 20:39:21

标签: shell awk data-structures grep

假设我有一个可预测的文本文档,其中包含一些称为X:的ID和已知的属性组合,例如:具有已知实例数的类别Y:(例如系列中每个Y:后始终只有1 X:):

  X:37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"

我想检索所有蓝色物品的物品ID列表。我不在乎是否有重复的ID,只是文档中有哪些ID值。然后我想对列表进行排序,并与具有完全相同结构的另一个结构化文本doc中的蓝色ID列表进行比较("两个文档共有哪些蓝色内容?""哪些蓝色内容在doc 1中但在doc 2中没有?")。

我知道我可以grep轻松地查找所有Y:BLUE行,但是我需要在每个此类实例中找到前一个X:所需的附加命令,并传递排序结果列表到diff?自从AmiShell以来,我还没有集中使用命令行......对不起:-(是否有在线使用此类用例的食谱?

1 个答案:

答案 0 :(得分:0)

让我们考虑您有以下2个输入文件:

$ more doc*
::::::::::::::
doc1
::::::::::::::
doc 1
  X:1
#  more data pertaining to item 37
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:2
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:3
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:RED
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:4
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
::::::::::::::
doc2
::::::::::::::
doc 2
  X:4
#  more data pertaining to item 37
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:3
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:2
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:RED
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:1
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"

您可以在每个文档上使用以下awk命令来获取ID:

$ awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1
1
2
4

$ awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2
1
3
4

说明:

  • -F':':定义为字段分隔符:
  • /X:[0-9]+$/{tmp=$2}会在tmp变量中保存ids的值(假设ids只由数字组成,而且线上没有其他内容),如果你不是这样的话可以调整过滤正则表达式/X:[0-9]+$/以满足您的需求
  • /Y:BLUE$/{a[NR]=tmp}当我们到达具有模式Y:BLUE的行(假设:EOL紧跟BLUE之后)时,我们将保存在tmp中的值添加到数组中
  • 在处理结束时我们对数组进行排序并打印它,请注意您在awk
  • 中更改了awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{print tmp}' | sort -n命令

然后您可以按以下方式将它们组合起来,找出两个文档之间蓝色ID的区别:

$ diff <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1) <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2)                                                    
2c2
< 2
---
> 3

或找到它们之间共同的蓝色ID:

$ comm -1 -2 <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1) <(awk -F':' '/X:[
0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2)                                              
1
4