用类似的键连接行

时间:2015-12-15 01:53:01

标签: python python-3.x

我正在学习Python,但我没有很多编程经验。 我正在尝试构建一个例程来导入一个CSV文件并迭代每一行中有一个特定键并在一行中连接这些行。

示例

CSV文件:

'0001','key1','name'
'0002','key1','age'
'0001','key2','name'
'0002','key2','age'

生成的文件应为:

['0001','key1','name','0002','key1','age']
['0001','key2','name','0002','key2','age']

我怎样才能做到这一点?

3 个答案:

答案 0 :(得分:3)

阅读CSV:

import csv

with open('my_csv.txt', 'rb') as f:
    my_list = list(csv.reader(f))

此时,my_list可能类似于列表列表,例如以下::

[['0001', 'key1', 'name'], ['0002', 'key1', 'age'], ['0001', 'key2', 'name'], ['0002', 'key2', 'age']]

创建一个dict,每个键[number]来自与dict中一个键对应的列表,并且dict中的每个值对应于特定键的连接列表:

dict_of_lists = {}

for item in my_list:
    _, key, _ = item
    if key in dict_of_lists.keys():
        dict_of_lists[key] = dict_of_lists[key] + item
    else:
        dict_of_lists[key] = item

如果您不关心列表项的顺序:

dict_of_lists.values()

输出:

[['0001', 'key2', 'name', '0002', 'key2', 'age'], ['0001', 'key1', 'name', '0002', 'key1', 'age']]

如果您关心订单:

from collections import OrderedDict
OrderedDict(sorted(dict_of_lists.items())).values()

输出:

[['0001', 'key1', 'name', '0002', 'key1', 'age'], ['0001', 'key2', 'name', '0002', 'key2', 'age']]

答案 1 :(得分:1)

如果您能够将所有条目存储在RAM中,则使用defaultdict创建“桶”。按键的条目是一种方法(假设一个名为' file.csv'的文件):

from collections import defaultdict

#this defaultdict acts as a Python dictionary, but creates an empty list
# automatically in case the key doesn't exist
entriesByKey = defaultdict(list)

with open("file.csv") as f:
    for line in f.readlines():
        #strips trailing whitespace and splits the line into a list
        # using "," as a separator
        entry = line.rstrip().split(",")
        #the key is the second field in each entry
        key = entry[1]
        #concatenate entry to its respective key 'bucket'
        entriesByKey[key] += entry

#Now, we create a list of concatenated lines by key, sorting them
# so that the keys appear in order
out = [entriesByKey[key] for key in sorted(entriesByKey.keys())]

#pretty-print the output :-)
import pprint
pprint.pprint(out)

此输入的程序输出为:

[["'0001'", "'key1'", "'name'", "'0002'", "'key1'", "'age'"],
 ["'0001'", "'key2'", "'name'", "'0002'", "'key2'", "'age'"]]

缺少的只是删除每个条目的单引号(并且可能根据您的喜好格式化输出而不是仅使用pprint())。如果您可以保证您的输入格式正确并且字段始终具有单引号(或者更准确地说,条目中每个字段的第一个和最后一个字符永远不相关),您可以通过添加以下内容来实现key = entry[1]行:

entry = [field[1:-1] for field in entry]

这将删除每个字段的第一个和最后一个字符。

答案 2 :(得分:0)

假设您的csv文件不包含单引号(并且这些仅用于此处的演示),这应该有效:

import pandas as pd
Data = pd.read_csv('Test.csv',header=None,dtype=str)
Result = Data.groupby(1).apply(lambda y: ','.join([s1 for s2 in y.values for s1 in s2]))
f = open('Result.csv','w')
for r in Result:
   f.write(r+'\n') 
f.close() 

输出存储在Result.csv