我是python的新手,并且正在处理10000行以上的CSV文件。在我的CSV文件中,有许多具有相同ID的行,我希望将其合并为一个行,同时还要合并其信息。
例如,data.csv看起来像(id和info是列的名称):
id| info
1112| storage is full and needs extra space
1112| there is many problems with space
1113| pickup cars come and take the garbage
1113| payment requires for the garbage
我想得到的输出为:
id| info
1112| storage is full and needs extra space there is many problems with space
1113| pickup cars come and take the garbage payment requires for the garbage
我已经看过一些帖子,例如1 2 3,但是没有一个可以帮助我回答我的问题。
如果您可以使用python代码来描述您的帮助,那我也可以在我的身边运行和学习,那将是很棒的。
基于@max注释的首次尝试:
import pandas as pd
data = pd.read_csv('data.csv', error_bad_lines=False);
data_text = data[['info']]
data_text['id'] = data_text.index
documents = data_text
print(len(documents))
print(documents[:5])
from collections import defaultdict
by_id = defaultdict(list)
for id, info in your_list:
by_id[id].append(info)
for key, value in by_id.items()
print(key, value)
@Guaz建议的第二次尝试:
import pandas as pd
data = pd.read_csv('data.csv', error_bad_lines=False);
data_text = data[['info']]
data_text['id'] = data_text.index
documents = data_text
print(len(documents))
print(documents[:5])
some_dict = {}
for idt, txt in id:
some_dict[idt] = some_dict.get(idt, "") + txt
谢谢
答案 0 :(得分:1)
只需创建一个以id为键的字典:
from collections import defaultdict
by_id = defaultdict(list)
for id, info in your_list:
by_id[id].append(info)
for key, value in by_id.items():
print(key, value)
答案 1 :(得分:1)
我考虑了一些更简单的方法:
some_dict = {}
for idt, txt in line: #~ For line use your id, info reader.
some_dict[idt] = some_dict.get(idt, "") + txt
它应该创建您的梦想结构而无需导入,我希望这是最有效的方法。
为了理解,get
具有secound参数,如果在dict中找不到某些内容,必须返回什么。然后创建空字符串并添加文本(如果找到了),然后向其添加文本。
@编辑:
这里是阅读器的完整示例:)。尝试正确替换变量而不是读取器条目,这说明了如何实现:)
some_dict = {}
with open('file.csv') as f:
reader = csv.reader(f)
for idt, info in reader:
temp = some_dict.get(idt, "")
some_dict[idt] = temp+" "+txt if temp else txt
print(some_dict)
df = pd.Series(some_dict).to_frame("Title of your column")
这是一个完整的程序,应该对您有用。
但是,如果文件中有两列以上,则将无法正常工作,那么您可以将idt, info
替换为row
,并将索引用于第一元素和第二元素。
@下一步编辑:
超过2列:
some_dict = {}
with open('file.csv') as f:
reader = csv.reader(f)
for row in reader:
temp = some_dict.get(row[0], "")
some_dict[row[0]] = temp+" "+row[1] if temp else row[1]
#~ There you can add something with another columns if u want.
#~ Example: another_dict[row[2]] = another_dict.get(row[2], "") + row[3]
print(some_dict)
df = pd.Series(some_dict).to_frame("Title of your column")