Question

我有这样的x行：

Unable to find latest released revision of 'CONTRIB_046578'.

我需要在此示例中的revision of '和'之间提取单词CONTRIB_046578，并尽可能使用grep计算该单词的出现次数，sed或任何其他命令？

Answer 1

最干净的解决方案是使用grep -Po "(?<=')[^']+(?=')"

$ cat file
Unable to find latest released revision of 'CONTRIB_046578'
Unable to find latest released revision of 'foo'
Unable to find latest released revision of 'bar'
Unable to find latest released revision of 'CONTRIB_046578'

# Print occurences 
$ grep -Po "(?<=')[^']+(?=')" file
CONTRIB_046578
foo
bar
CONTRIB_046578

# Count occurences
$ grep -Pc "(?<=')[^']+(?=')" file
4

# Count unique occurrences 
$ grep -Po "(?<=')[^']+(?=')" file | sort | uniq -c 
2 CONTRIB_046578
1 bar
1 foo

Answer 2

这是一个awk脚本，您可以使用它来提取和计算单引号中每个单词的频率：

awk '{for (i=1; i<=NF; i++) {if ($i ~ /^'"'.*?'"'/ ) cnt[$i]++;}} 
      END {for (a in cnt) {b=a; gsub(/'"'"'/, "", b); print b, cnt[a]}}' infile

测试

cat infile
Unable to find latest released revision of 'CONTRIB_046572'
Unable to find latest released revision of 'CONTRIB_046578'
Unable to find latest released revision of 'CONTRIB_046579'
Unable to find latest released revision of 'CONTRIB_046570'
Unable to find latest released revision of 'CONTRIB_046579'
Unable to find latest released revision of 'CONTRIB_046572'
Unable to find latest released revision of 'CONTRIB_046579'

<强>输出：

 awk '{for (i=1; i<=NF; i++) {if ($i ~ /^'"'.*?'"'/ ) cnt[$i]++;}} 
      END {for (a in cnt) {b=a; gsub(/'"'"'/, "", b); print b, cnt[a]}}' infile

CONTRIB_046579 3
CONTRIB_046578 1
CONTRIB_046570 1
CONTRIB_046572 2

Answer 3

您只需要一个非常简单的awk脚本来计算引号之间的内容：

awk -F\' '{c[$2]++} END{for (w in c) print w,c[w]}' file

使用@ anubhava的测试输入文件：

$ cat file
Unable to find latest released revision of 'CONTRIB_046572'
Unable to find latest released revision of 'CONTRIB_046578'
Unable to find latest released revision of 'CONTRIB_046579'
Unable to find latest released revision of 'CONTRIB_046570'
Unable to find latest released revision of 'CONTRIB_046579'
Unable to find latest released revision of 'CONTRIB_046572'
Unable to find latest released revision of 'CONTRIB_046579'
$
$ awk -F\' '{c[$2]++} END{for (w in c) print w,c[w]}' file
CONTRIB_046578 1
CONTRIB_046579 3
CONTRIB_046570 1
CONTRIB_046572 2

Answer 4

假设：

每个单词可以发生多个时间，并且OP想要计算每个单词的出现次数。
文件中没有其他行

输入文件：

$ cat test.txt 
Unable to find latest released revision of 'CONTRIB_046578'.
Unable to find latest released revision of 'CONTRIB_046572'.
Unable to find latest released revision of 'CONTRIB_046579'.
Unable to find latest released revision of 'CONTRIB_046570'.
Unable to find latest released revision of 'CONTRIB_046572'.
Unable to find latest released revision of 'CONTRIB_046578'.

用于过滤和计算字词的Shell脚本：

$ sed "s/.*'\(.*\)'.*/\1/" test.txt | sort | uniq -c
  1 CONTRIB_046570
  2 CONTRIB_046572
  2 CONTRIB_046578
  1 CONTRIB_046579

Answer 5

sed 's/.*\'(.*?)\'.*/$1/' myfile.txt

Answer 6

如果下面的测试文件代表实际问题中的文件，则以下内容可能有用。

基于测试文件中的每一行是同质的 - 即格式良好且包含8列（或字段） - 使用cut命令的方便解决方案如下：

文件：

Unable to find latest released revision of 'CONTRIB_046572' Unable to find latest released revision of 'CONTRIB_046578' Unable to find latest released revision of 'CONTRIB_046579' Unable to find latest released revision of 'CONTRIB_046570' Unable to find latest released revision of 'CONTRIB_046579' Unable to find latest released revision of 'CONTRIB_046572' Unable to find latest released revision of 'CONTRIB_046579'

<强>代码：

cut -d ' ' -f 8 file | tr -d "'" | sort | uniq -c

<强>输出：

1 CONTRIB_046570 2 CONTRIB_046572 1 CONTRIB_046578 3 CONTRIB_046579

关于代码的注意事项：cut用于分隔每个字段的默认分隔符是tab，但由于我们要求分隔符是单个空格来分隔每个字段，因此我们指定选项{ {1}}。其余代码与其他答案类似，所以我不会重复所说的内容。

一般说明：如果文件没有按照我上面提到的那样格式化，此代码可能无法达到所需的输出。

在引号之间说出来

6 个答案:

测试