Question

我需要在保存之前替换大型DataFrame中的值（实际上我在200k行块中读取1M + SAS表，格式化数据并保存到castra存储）。我使用Series.map(dict).combine_first(Series)来替换值，而且速度很快。但它不能用NaN替换值，因为combine_first在这种情况下会返回旧值。我尝试使用replace方法，它工作了一段时间，最后给出了“无法比较类型对象和str”错误。

这是一个相关的代码示例（200k int系列和12k项目替换词典）：

sl = pd.Series(range(200000))
r = {i: -i for i in range(100000,112000)}
sl2 = sl.map(r).combine_first(sl)
>> sl2[100001]
>> -100001.0

sl3 = sl.replace(r)
>> TypeError: Cannot compare types 'ndarray(dtype=int32)' and 'int'

第一种方法以某种方式将int转换为float（这不是问题因为我主要有字符串数据），第二种方法在错误发生之前吃掉了8％的8g RAM。

那么如何替换值并将一些值设置为NaN？

Answer 1

我决定将这两种方法结合起来。首先是map非空值，然后是replace，其值为空值

def replace(s, d):
    if type(d) is not pd.Series:
        d = pd.Series(d)
    dn = d[d.isnull()]
    if len(dn):
        d = d[~d.index.isin(dn.index)]
    if len(d):
        s = s.map(d).combine_first(s)
    if len(dn):
        s = s.replace(dn)
    return s

替换pandas系列

1 个答案: