Question

我需要经常添加一个数据帧（或系列，如果它更有效），同时确保添加不会创建重复。随着数据框架的增长，似乎只会通过简单地调用drop_duplicates来实现这一点，因为需要检查每个添加的整个数据集是否有重复。

数据只有两列，所以我猜测将一个变成索引可能会加快速度。（或将两列放入分层索引中）。大熊猫有办法禁止重复索引吗？

以下是一个示例问题：

print accumulating_result
  c1  c2
0  A  x1
1  B  x2
2  B  x3
3  C  x4

print new
  c1  c2
0  B  x3
1  C  x4
2  C  x5

对accumulating_result执行添加新操作并获取：

print accumulating_result
  c1  c2
0  A  x1
1  B  x2
2  B  x3
3  C  x4
4  C  x5

对于它的价值，第c2列中的每个条目都是唯一的。

有什么想法吗？

Answer 1

您可以使用combine_first()：

data1 = """  c1  c2
0  A  x1
1  B  x2
2  B  x3
3  C  x4"""


data2 = """  c1  c2
0  X  x3
1  Y  x4
2  Z  x5"""

import io
import pandas as pd

df1 = pd.read_csv(io.BytesIO(data1), delim_whitespace=True)
df2 = pd.read_csv(io.BytesIO(data2), delim_whitespace=True)

df1.set_index("c2", inplace=True)
df2.set_index("c2", inplace=True)

df1.combine_first(df2)

输出：

   c1
c2   
x1  A
x2  B
x3  B
x4  C
x5  Z

但每次都会复制所有数据。也许使用HDF5或数据库更好。

添加到没有重复的系列的有效方法

1 个答案: