Question

背景

我使用pandas从DB（redshift）加载大型数据集（最多几百万行），DB中的一些列最初是作为字符串保存的jsons。加载数据后，我使用json.loads将包含转储的json对象的单元格转换为字典。这些对象每个单元大约2.5MB。

问题

在浏览大型数据帧时，此过程可能需要超过10GB的内存才能完成，甚至会在某些计算机上崩溃。执行时间也很慢但对我来说这是一个小问题。当单元格包含大型json对象时，问题只会出现。

所有操作都在内存中完成，我宁愿避免使用HDFS或其他磁盘解决方案，因为数据帧本身在内存中更易于管理。

尝试解决方案失败

使用另一个模块解析字典 - 我尝试使用ujson，simplejson和ast.literal_eval代替json模块，但性能没有明显差异。

使用apply vs using map - 尝试了两者，似乎都没有改善内存使用问题。

代码

数据框在self.df中保存在类中，这是处理转换的代码：

def turnAllJsonColumnsToDict(self):
    """
    Scan the df and turn column to json if it's a string in json format
    """
    print 'Checking columns for needed type conversions'
    for col in self.df.columns:
        if self.check_if_json(self.df[col].iloc[0], col):
            self.loadColumnAsJson(col)

@staticmethod
def check_if_json(col, col_name, should_print=True):
    if isinstance(col, basestring):
        try:
            if col[0] in ('{', '[') and col[-1] in ('}', ']'):
                if should_print:
                    print 'Converting', col_name
                return True
        except IndexError as e:
            if should_print:
                print 'Failed to check ', col_name

@timeIt
def loadColumnAsJson(self, column):
    self.json_load_fail_counter = 0
    self.df[column] = map(self.loadJson, self.df[column])
    print '{failed} cells failed to be parsed to json in {column} (out of {rows})'.format(
        failed=self.json_load_fail_counter, column=column,
        rows=len(self.df))

def loadJson(self, value):
    if not value:
        return value

    # the json dump is not in a python structure
    null, true, false = None, True, False
    value = self.fix_dumped_json(value)
    try:
        if type(value) == dict:
            return value
        else:
            return json.loads(value) if value else {}
    except Exception as e:
        self.json_load_fail_counter += 1
    return {}

@staticmethod
def fix_dumped_json(value):
    value = value.replace('"None"', 'null')
    counter = 0
    for i in value:
        counter += 1 if i == '{' else -1 if i == '}' else 0
    if counter > 0:
        for c in range(counter):
            value += '}'
    return value

Pandas - 在列中将转储的json作为字典加载时提高内存使用率

背景

问题

尝试解决方案失败

代码

0 个答案: