Question

我的数据格式如下：

1;string1
2;string2
...
n;stringn

第一列是id-number，第二列包含文本字符串。文本字符串可能包含数字，字母和字符，例如/.()?!。 Id号码等于行号。我试图找出这些文本字符串中的重复项。我希望得到这样的信息：

String of id 1 is duplicated on lines/ids 4,6,7
String of id 2 is duplicated on lines/ids 11,25

到目前为止，我已经使用Awk命令完成了这项工作：

awk '/String of text/ {print FNR}' targetfile

并手动替换我文件中每个文本字符串的搜索字符串。由于数据集现在更大，这变得不切实际。我的Awk命令可以改进，以便它会自动测试文件中的每个文本字符串与其他字符串，并输出到我正在寻找的信息？我虽然为此使用for循环，但无法弄清楚如何使它工作。

如果有更好的解决方案，我也可以使用除Awk之外的其他工具。我的系统是Ubuntu 14.04。

Answer 1

把这个（评论中的解释）：

{ seen[$2] = seen[$2] $1 " " }               # remember where you saw strings
                                             # as string of numbers

END {                                        # in the end
  for(s in seen) {                           # for all strings you saw
    split(seen[s], nums, " ");               # split apart the line numbers again

    if(length(nums) > 1) {                   # if you saw it more than once
      line = s " is duplicated on lines";    # build the output line
      for(i = 1; i <= length(nums); ++i) {   # with all the line numbers where you 
        line = line " " nums[i]              # saw it
      }
      print line                             # and print the line
    }
  }
}

到文件中，说foo.awk，然后运行awk -F \; -f foo.awk filename

你也可以把它放在这样的一行：

awk -F \; '{ seen[$2] = seen[$2] $1 " " } END { for(s in seen) { split(seen[s], nums, " "); if(length(nums) > 1) { line = s " is duplicated in lines"; for(i = 1; i <= length(nums); ++i) { line = line " " nums[i] } print line } } }' filename

...但是我已经足够长时间使用文件了。

从文本中查找重复项的位置

1 个答案: