Question

我有一个包含数千行的文件，每行包含一个数字，然后是一行文本。我想将文本相似的行的数字相加。我也希望输出独特的行。

例如：

25 cup of coffee
75 sign on the dotted
28 take a test
2 take a test
12 cup of coffee

输出为：

37 cup of coffee
75 sign on the dotted
30 take a test

有人建议如何在unix shell中实现这一目标吗？

我查看了Shell command to sum integers, one per line?，但这是关于在文件的所有行中汇总一列数字，而不是仅在相似的文本行中汇总。

Answer 1

不需要多个进程和管道。单独使用awk的能力远远超过了处理整个作业的能力（在大型文件上，速度要快几个数量级）。使用awk，只需将每个字段2-NF附加为字符串，然后将其用作索引，即可对数组中字段1中的数字求和。然后在END部分中，只需输出数组的内容，例如假设您的数据存储在file中，则可以执行以下操作：

awk '{
    for (i=2; i<=NF; i++)
        str = str " " $i
    a[str] += $1
    str=""
}
END {
    for (i in a) print a[i], i
}' file

上面，第一个for循环只是将2-NF中str的所有字段附加起来，a[str] += $1使用{将字段1中的值求和到数组a中{1}}作为索引。这样可以确保对相似行的值求和。在str部分中，您只需循环遍历数组的每个元素，依次输出元素值（总和）和索引（字段END的原始str）。

使用/输出示例

仅需选择上面的内容，然后将其鼠标中键粘贴到2-NF所在目录的命令行中即可（将file的名称更改为数据文件名）

file

如果要以不同的顺序对行进行排序，只需在文件名后添加$ awk '{ > for (i=2; i<=NF; i++) > str = str " " $i > a[str] += $1 > str="" > } > END { > for (i in a) print a[i], i > }' file 30 take a test 37 cup of coffee 75 sign on the dotted，即可将输出通过管道传输到| sort [options]。例如，对于按照您显示的顺序进行的输出，您将使用sort，而输出将是：

| sort -k 2

保留字符串的原始顺序

根据有关如何保持输入文件中看到的文本行的原始顺序的评论，您可以保留第二个数组，其中使用顺序索引将字符串按其出现的顺序存储在其中，以将它们保留在其中订购。例如，下面使用37 cup of coffee 75 sign on the dotted 30 take a test数组（顺序数组）存储唯一的字符串（字段o），而变量2-NF用作计数器。使用数组上的循环检查字符串是否已包含，如果是，则使用n避免存储字符串并跳转到下一条输入记录。然后在next中，循环使用END形式从两个数组中输出信息，顺序是在原始文件中看到字符串的顺序，例如

for (i = 0; i < n; i++)

输出

awk -v n=0 '{
    for (i=2; i<=NF; i++)
        str = str " " $i
    a[str] += $1
    for (i = 0; i < n; i++)
        if (o[i] == str) {
            str=""
            next;
        }
    o[n++] = str;
    str=""
}
END {
    for (i = 0; i < n; i++) print a[o[i]], o[i]
}' file

Answer 2

您可以执行以下操作（假设文件名为file.txt）：

for key in $(sort -k2  -u file.txt   | cut -d ' ' -f2)
do 
    cat file.txt|grep $key  | awk '{s+=$1} END {print $2 "\t" s}'
done

说明： 1.获取所有唯一键（喝咖啡，在虚线上签名，进行测试）：

sort -k2  -u file.txt   | cut -d ' ' -f2

2。 grep使用文件中唯一键的所有行：

cat file.txt | grep $key

3。使用awk对行求和，其中$ 1 =数字列，$ 2 =键

awk '{s+=$1} END {print $2 "\t" s}'

将所有内容放入for循环并遍历唯一键

注意：如果一个键可以是另一个键的子字符串，例如“ coffee”和“ cup of coffee”，则需要将第2步更改为带有正则表达式的grep

Answer 3

您的意思是这样的吗？

#!/bin/bash

# define a dictionary
declare -A dict

# loop over all lines
while read -r line; do

   # read first word as value and the rest as text
   IFS=' ' read value text <<< "$line"

   # use 'text' as key, get value for 'text', default 0
   [ ${dict[$text]+exists} ] && dictvalue="${dict[$text]}" || dictvalue=0

   # sum value

   value=$(( $dictvalue + value )) 

   # save new value in dictionary
   dict[$text]="$value" 
done < data.txt  

# loop over dictionary, print sum and text
for key in "${!dict[@]}"; do
   printf "%s %s\n" "${dict[$key]}" "$key"
done

输出

37 cup of coffee
75 sign on the dotted
30 take a test

Answer 4

以下是执行任务的简单awk脚本：

script.awk

{                          # for each input line
    inpText = substr($0, length($1)+2);  # read the input text after 1st field
    inpArr[inpText] = inpArr[inpText] + 0 + $1; # accumulate the 1st field in array
}
END {                     # post processing
    for (i in inpArr) {   # for each element in inpArr
        print inpArr[i], i; # print the sum and the key
    }
}

input.txt

25 cup of coffee
75 sign on the dotted
28 take a test
2 take a test
12 cup of coffee

运行：

awk -f script.awk input.txt

输出：

75 sign on the dotted
37 cup of coffee
30 take a test

Answer 5

另一个基于与here @David相同的逻辑的版本。
更改：它省略了循环以加快过程。

awk '
{
  text=substr($0, index($0,$2))
  if(!(text in text_sums)){ texts[i++]=text }
  text_sums[text]+=$1
}
END {
 for (i in texts) print text_sums[texts[i]],texts[i] 
}' input.txt

说明：
substr返回以字段2开头的字符串，即文本部分
数组texts将文本存储在整数索引中，如果它不存在于text_sums数组中。
text_sums继续为相应的文本添加字段1。

在一个单独的数组后面存储文本作为值（由连续整数作为索引支持）的原因是为了确保值（文本）的顺序同时以相同的连续顺序进行访问。

请参见Array Intro

脚注：

awk实现之间的顺序会有所不同，这些实现通常使用哈希表来存储数组元素和值。

Answer 6

使用datamash相对简洁。首先使用sed将第一个空格更改为制表符，（此作业datamash必须具有一个且只有一个制表符分隔符），然后使用-s -g2对第二个字段进行分组排序，（ ie “杯子” 等），然后使用sum 1按组累加第一列编号，此操作就完成了。不，不完全是-由于某些原因，number列迁移到了 2nd 字段，因此reverse将其迁移回了 1st 字段：

sed 's/ /\t/' file | datamash -s -g2 sum 1 | datamash reverse

输出：

37  cup of coffee
75  sign on the dotted
30  take a test

Shell命令对文件中相似文本行之间的数字求和

6 个答案: