Question

我在几个目录中有json数据文件，我想导入Pandas进行一些数据分析。 json的格式取决于目录名称中定义的类型。例如，

dir1_typeA/
  file1
  file2
  ...
dir1_typeB/
  file1
  file2
  ...
dir2_typeB/
  file1
  ...
dir2_typeA/
  file1
  file2

每个file包含一个复杂的嵌套json字符串，它将是DataFrame的一行。每个TypeA和TypeB都有两个数据帧。稍后我会在需要时附加它们。

所以，到目前为止，我已经获得了os.walk所需的所有文件路径并且正在尝试通过

    import os
    from glob import glob

    PATH = 'dir/filepath'
    files = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], 'file*'))]

    for file in files:
        with open(issuefile, 'r') as f:
            data = f.read()

        data_json = json_normalize(json.loads(data))
        type = ' '.join(issuefile.split('/')[3]
        data_json['type'] = type
        # append to data frame for typeA and typeB
        if 'typeA' in type:
            # append to typeA dataframe
        else:
            # append to typeB dataframe

还有一个问题，即目录中的文件可能会有稍微不同的字段。例如，file1可能还有一些file2 dir1_typeA中的字段。因此，我需要在每种类型的数据框架中适应这种动态特性。

如何创建这两个数据帧？

Answer 1

我认为你应该先将这些文件连接起来，然后再将它们读入pandas，这里是你用bash做的方法（你也可以用Python做）：

cat `find *typeA` > typeA
cat `find *typeB` > typeB

然后您可以使用io.json.json_normalize将其导入pandas：

import json
with open('typeA') as f:
    data = [json.loads(l) for l in f.readlines()]
    dfA = pd.io.json.json_normalize(data)

dfA

#          that this.first this.second
# 0  otherthing      thing       thing
# 1  otherthing      thing       thing
# 2  otherthing      thing       thing

如何将几个文件中的规范化json数据导入到pandas数据帧中？

1 个答案: