Question

在正则表达式或awk中使用linux时没有太多经验，不知道最好的方法是什么。

我有一个类似于

的文本文件

492 "Steve Smith"
455 "Steve Smith"
322 "Steve Smith"
123 "John Doe"
234 "John Doe"
etc.

我想要的输出是：

Steve Smith - 492, 455, 322
John Doe - 123, 234

Answer 1

您可以将文件导入sqlite3数据库，然后进行选择查询。

$ sudo apt install sqlite3
$ sqlite3
> create table test (num integer, name  text);
> .separator " "
> .import your_file test
> select name || " - " || group_concat(num) from test group by name;

Answer 2

关注awk可能对您有帮助。

解决方案1：

awk '{
match($0,/".*"/);
val=substr($0,RSTART,RLENGTH);
a[val]=a[val]?a[val] OFS $1:$1
}
END{
for(i in a){
 print i" - "a[i]
}}
' OFS=", "   Input_file

输出如下。

"John Doe" - 123, 234
"Steve Smith" - 492, 455, 322

解决方案第二： 如果你想按照你的Input_file输出相同的顺序，那么下面也可以帮助你。

awk '{
match($0,/".*"/);
val=substr($0,RSTART,RLENGTH);
}
!b[val]++{
  num++
}
{
a[val]=a[val]?a[val] OFS $1:$1;
c[num]=a[val];
d[num]=val
}
END{
for(i=1;i<=num;i++){
  print d[i]" - "c[i]
}}
' OFS=", "   Input_file

输出如下。

"Steve Smith" - 492, 455, 322
"John Doe" - 123, 234

解决方案1的说明：

awk '{
match($0,/".*"/);              ##match is awk out of the box function which will match a regex provided by us into a variable or current line, I am matching here everything that starts from " to till " in current line.
val=substr($0,RSTART,RLENGTH); ##creating variable named val here whose value will be substring(substr is awk out of the box keyword) this substring starting point will be value of RSTART variable till the value of RLENGTH variable. NOTE: RSTART and RLENGTH variables values will be SET once a match is found in match function which we used previous step.
a[val]=a[val]?a[val] OFS $1:$1 ##creating array a whose index is variable val and it is concatenating its value in it as per the index of array.
}
END{                           ##starting end section of awk here which will be executed once complete Input_file is read.
for(i in a){                   ##starting a for loop here which will iterate in array a all values.
 print i" - "a[i]              ##printing the value of variable named i(which is actually index of array a) then " - " and then value of array a whose index is i.
}}
' OFS=", "  Input_file         ##Setting OFS(output field separator) value as ", " and mentioning Input_file name here too.

Answer 3

这可以得到你想要的东西（但没有逗号）：

$ awk -F'"' '{a[$2]=a[$2]$1} END{for (name in a) printf "%s - %s\n",name,a[name]}' file
Steve Smith - 492 455 322 
John Doe - 123 234

要包含逗号：

$ awk -F'"' '{a[$2]=a[$2]", "$1+0} END{for (name in a) printf "%s - %s\n",name,substr(a[name],3)}' file
Steve Smith - 492, 455, 322
John Doe - 123, 234

如何运作

-F'"'

这告诉awk使用双引号"作为字段分隔符。这样，数字是字段1，名称是字段2.
a[$2]=a[$2]", "$1+0

对于每一行，我们将一个逗号和一个数字附加到具有键a的关联数组$2的值。

第二个字段$2是名称。 a[$2]是该名称的数字列表。对于我们阅读的每一个新行，我们将a[$2]替换为a[$2]的先前值，后跟逗号空格，然后是第一个字段加零，$1+0。我们使用+0强制第一个字段为数字。这消除了第一个领域的额外空间。
END{for (name in a) printf "%s - %s\n",name,substr(a[name],3)}

在我们到达文件末尾之后，我们打印每个名称后跟一个空格 - 空格，然后是我们的数字列表。 substr函数从数字字符串的开头删除多余的逗号。

名称按任意顺序打印。您可能希望通过sort传递输出以按字母顺序排列。

基于名称在linux中对文本文件进行分组

3 个答案:

如何运作