Question

我有回调方法，该方法将循环创建python dictionary并将这些字典附加到pandas数据帧中，

def process_data(self, _data, ec_search, ec_helpers, _log):
    _data_dict = {}
    for single_data in _data:
        _id = single_data.get('id')
        latlon = single_data.get('latlon')
        country_code = single_data.get('country_code')
        _data_dict[property_id] = {'latlon': latlon,'country_code':country_code}

    output = pd.DataFrame() # what to do here?
    output = output.append(_data_dict, ignore_index=True)
    print(output.head())

问题是，当我从另一个python def 调用该回调时，它将创建pandas数据帧并将字典追加为行，但是当我调用第二个或更进一步的代码时比它会重新初始化output = pd.DataFrame()并附加字典。因此，我只想在添加字典时使我现有的数据框架完好无损，我已经看到了使用pd.concate的类似解决方案，但不确定这样做是否正确，否则会造成性能问题，因为我必须处理大约1000万个数据集？

Answer 1

那么，您可以使用global关键字来获取数据框并内置检查该数据框是否已经存在？或在程序开始时将空df用作全局变量。无论哪种方式，如果要保留状态，那么都需要在函数外部保护数据框。

示例：

x = int(1)

def some_function():
    global x
    for i in range(1, 10):
        x += 1

some_function()
print(x)

这将输出10，因为变量x存储在方法外部，并用global在方法内部声明。

Answer 2

创建一个类：

class Output(object):
    def __init__(self):
        self.data = pd.Datarame()
        
    def append(self, _data_dict, ignore_index):
        self.data = self.data.append(_data_dict, ignore_index)

现在，这应该可以工作：

output = Output()
output.append(_data_dict, ignore_index=True) ## call it how ever many times you want!

print(output.data.head())

将字典附加到回调中的pandas数据框

2 个答案: