如何在Python中合并和合并具有相同ID(索引)的行?

时间:2019-05-26 21:51:27

标签: python pandas csv

我是python的新手,并且正在处理10000行以上的CSV文件。在我的CSV文件中,有许多具有相同ID的行,我希望将其合并为一个行,同时还要合并其信息。

例如,data.csv看起来像(id和info是列的名称):

id| info

1112| storage is full and needs extra space

1112| there is many problems with space 

1113| pickup cars come and take the garbage

1113| payment requires for the garbage 

我想得到的输出为:

id| info

1112| storage is full and needs extra space there is many problems with space

1113| pickup cars come and take the garbage payment requires for the garbage

我已经看过一些帖子,例如1 2 3,但是没有一个可以帮助我回答我的问题。

如果您可以使用python代码来描述您的帮助,那我也可以在我的身边运行和学习,那将是很棒的。

基于@max注释的首次尝试:

import pandas as pd
data = pd.read_csv('data.csv', error_bad_lines=False);
data_text = data[['info']]
data_text['id'] = data_text.index
documents = data_text
print(len(documents))
print(documents[:5])


from collections import defaultdict

by_id = defaultdict(list)

for id, info in your_list:
    by_id[id].append(info)

for key, value in by_id.items()
    print(key, value)

@Guaz建议的第二次尝试:

import pandas as pd
data = pd.read_csv('data.csv', error_bad_lines=False);
data_text = data[['info']]
data_text['id'] = data_text.index
documents = data_text
print(len(documents))
print(documents[:5])

some_dict = {}
for idt, txt in id: 
    some_dict[idt] = some_dict.get(idt, "") + txt

谢谢

2 个答案:

答案 0 :(得分:1)

只需创建一个以id为键的字典:

from collections import defaultdict

by_id = defaultdict(list)

for id, info in your_list:
    by_id[id].append(info)

for key, value in by_id.items():
    print(key, value)

答案 1 :(得分:1)

我考虑了一些更简单的方法:

some_dict = {}
for idt, txt in line: #~ For line use your id, info reader.
    some_dict[idt] = some_dict.get(idt, "") + txt

它应该创建您的梦想结构而无需导入,我希望这是最有效的方法。 为了理解,get具有secound参数,如果在dict中找不到某些内容,必须返回什么。然后创建空字符串并添加文本(如果找到了),然后向其添加文本。

@编辑:

这里是阅读器的完整示例:)。尝试正确替换变量而不是读取器条目,这说明了如何实现:)

some_dict = {}
with open('file.csv') as f:
    reader = csv.reader(f)
    for idt, info in reader:
        temp = some_dict.get(idt, "")
        some_dict[idt] = temp+" "+txt if temp else txt
print(some_dict)
df = pd.Series(some_dict).to_frame("Title of your column")

这是一个完整的程序,应该对您有用。 但是,如果文件中有两列以上,则将无法正常工作,那么您可以将idt, info替换为row,并将索引用于第一元素和第二元素。

@下一步编辑:

超过2列:

some_dict = {}
with open('file.csv') as f:
    reader = csv.reader(f)
    for row in reader:
        temp = some_dict.get(row[0], "")
        some_dict[row[0]] = temp+" "+row[1] if temp else row[1]
        #~ There you can add something with another columns if u want.
        #~ Example: another_dict[row[2]] = another_dict.get(row[2], "") + row[3]
print(some_dict)
df = pd.Series(some_dict).to_frame("Title of your column")