重复列值及其相应的未重复列的解决方案

时间:2014-03-30 06:00:08

标签: python awk

比如说我的输入文件 - file1.tsv有以下2列

type         grocery
fruits       orange
fruits       apple
fruits       kiwi
greens       collard
greens       spinach

期望的结果是

type         grocery
fruits       orange, apple, kiwi
greens       collard, spinach

我可以在第1列中读取重复项作为字典但我无法用逗号附加未重复的第2列值。在python中有这个快速解决方案吗?

5 个答案:

答案 0 :(得分:2)

如果文件按列1分组:

awk 'p==$1{s=s ", " $2; next} {if(p)print s; p=$1; s=$0} END{print s}' file

答案 1 :(得分:1)

您只需将值存储为数组:

types = ['type','fruits','greens']
values = [['grocery'],['orange','apple','kiwi'],['collard', 'spinach']]

my_dict = dict(zip(types, values))

>>> print my_dict
{'type': ['grocery'], 'fruits': ['orange','apple','kiwi'], 'greens': ['collard', 'spinach']}

这样,如果你想添加任何内容,你只需要这样做:

my_dict['type'].append('dairy')
my_dict['fruits'].append('banana')

如果你想创建一个新类型,只需使用一个新名称,python将自动创建一个新的键值对,如下所示:

my_dict['meats'] = ['beef', 'chicken', 'fish']
>>> len(my_dict['meats'])    # number of items in 'meats'
3

答案 2 :(得分:1)

您的输入

$ cat f
type         grocery 
fruits       orange
fruits       apple
fruits       kiwi
greens       collard
greens       spinach

Awk代码:

  awk 'NR==1{
              print
              next
            }
            {
              A[$1]=A[$1]?A[$1]","$2:$2
            }
         END{
              for(i in A)
              print i,A[i]
            }' f

所得

type         grocery
greens collard,spinach
fruits orange,apple,kiwi

<强> - 编辑 -

如果订单很重要,试试这个,输入两次相同的文件。

awk 'FNR==NR{
              A[$1]=A[$1]?A[$1]","$2:$2
              next
            }
   ($1 in A){
              print $1,A[$1];
              delete A[$1]
            }' f f

所得

type grocery
fruits orange,apple,kiwi
greens collard,spinach

答案 3 :(得分:0)

使用awk,

awk '{ arr[$1] = arr[$1] ? arr[$1] ", " $2 : $2 } \
END { for (var in arr) print var, " ", arr[var] }' file1.tsv

答案 4 :(得分:0)

另一个Python解决方案

from collections import defaultdict
from csv import DictReader

d = defaultdict(list)
with open('file1.tsv') as f:                                                
    x = DictReader(f, delimiter='\t')
    for l in x:
            d[l['type']].append(l['grocery'])                               

print " ".join(l.iterkeys())
for k in d:
    print k, ",".join(d[k])