我正在使用我正在研究的数据分析项目。
基本上,如果我有示例CSV'A':
id | item_num
A123 | 1
A123 | 2
B456 | 1
我有示例CSV'B':
id | description
A123 | Mary had a...
A123 | ...little lamb.
B456 | ...Its fleece...
如果我使用merge
执行Pandas
,则结果如下:
id | item_num | description
A123 | 1 | Mary had a...
A123 | 2 | Mary had a...
A123 | 1 | ...little lamb.
A123 | 2 | ...little lamb.
B456 | 1 | Its fleece...
我怎样才能成为:
id | item_num | description
A123 | 1 | Mary had a...
A123 | 2 | ...little lamb...
B456 | 1 | Its fleece...
这是我的代码:
import pandas as pd
# Import CSVs
first = pd.read_csv("../PATH_TO_CSV/A.csv")
print("Imported first CSV: " + str(first.shape))
second = pd.read_csv("../PATH_TO_CSV/B.csv")
print("Imported second CSV: " + str(second.shape))
# Create a resultant, but empty, DF, and then append the merge.
result = pd.DataFrame()
result = result.append(pd.merge(first, second), ignore_index = True)
print("Merged CSVs... resulting DataFrame is: " + str(result.shape))
# Lets do a "dedupe" to deal with an issue on how Pandas handles datetime merges
# I read about an issue where if datetime is involved, duplicate entires will be created.
result = result.drop_duplicates()
print("Deduping... resulting DataFrame is: " + str(result.shape))
# Save to another CSV
result.to_csv("EXPORT.csv", index=False)
print("Saved to file.")
我真的很感激任何帮助 - 我很困难!我正处理20,000多行。
感谢。
编辑:我的帖子被标记为潜在的重复。不是,因为我不一定要添加专栏 - 我只是想阻止description
乘以归因于特定item_num
的{{1}}数量}。
更新,6/21:
如果2个DF看起来像这样,我怎么能进行合并?
id
我有示例CSV'B':
id | item_num | other_col
A123 | 1 | lorem ipsum
A123 | 2 | dolor sit
A123 | 3 | amet, consectetur
B456 | 1 | lorem ipsum
所以我最终得到了:
id | item_num | description
A123 | 1 | Mary had a...
A123 | 2 | ...little lamb.
B456 | 1 | ...Its fleece...
意思是,在“other_col”中具有3,“amet,consectetur”的行将被忽略。
答案 0 :(得分:1)
尝试索引你的df,然后删除重复项:
df = df.set_index(['id', 'item_num']).drop_duplicates()
答案 1 :(得分:1)
我认为你需要concat
result = pd.concat([df1.set_index('id'), df2.set_index('id')],axis = 1).reset_index()
你得到了
id item_no description
0 A123 1 Mary had a...
1 A123 2 ...little lamb
2 B456 1 ...Its fleece...
答案 2 :(得分:1)
我这样做:
In [135]: result = A.merge(B.assign(item_num=B.groupby('id').cumcount()+1))
In [136]: result
Out[136]:
id item_num description
0 A123 1 Mary had a...
1 A123 2 ...little lamb.
2 B456 1 ...Its fleece...
说明:我们可以在item_num
DF中创建“虚拟”B
列以加入:
In [137]: B.assign(item_num=B.groupby('id').cumcount()+1)
Out[137]:
id description item_num
0 A123 Mary had a... 1
1 A123 ...little lamb. 2
2 B456 ...Its fleece... 1