使用JSON列展平pandas数据框

时间:2018-05-17 10:55:13

标签: python pandas

我有一个CSV格式的非常大的数据集,其中一列是JSON字符串。我想将此信息读入 flat Pandas数据框。我怎样才能有效地实现这一目标?

输入CSV:

col1,col2,col3,col4
1,Programming,"{""col3_1"":null,""col3_2"":""Java""}",11
2,Sport,"{""col3_1"":null,""col3_2"":""Soccer""}",22
3,Food,"{""col3_1"":null,""col3_2"":""Pizza""}",33 

预期的DataFrame:

+---------------------------------------------------------------+
|   col1    |    col2     |   col3_1    |   col3_2  |   col4    |
+---------------------------------------------------------------+
|    1      | Programming |    None     |    Java   |    11     |
|    2      |    Sport    |    None     |   Soccer  |    22     |
|    3      |    Food     |    None     |   Pizza   |    33     |
+---------------------------------------------------------------+

我目前可以使用以下代码获得预期的输出。我只是想知道是否有更有效的方法来实现同样的目标。

import json
import pandas
dataset = pandas.read_csv('/dataset.csv')
dataset['col3'] = dataset['col3'].apply(json.loads)
dataset['col3_1'] = dataset['col3'].apply(lambda row: row['col3_1'])
dataset['col3_2'] = dataset['col3'].apply(lambda row: row['col3_2'])
dataset = dataset.drop(columns=['col3'])

2 个答案:

答案 0 :(得分:4)

您可以使用{{1}}解析Pandas列中的JSON,并使用{{1}}将其转换为Pandas列:

{{1}}

答案 1 :(得分:3)

使用DataFrame构造函数df1 = pd.DataFrame(df.pop('col3').apply(pd.io.json.loads).values.tolist(), index=df.index) df = df.join(df1) print (df) col1 col2 col4 col3_1 col3_2 0 1 Programming 11 None Java 1 2 Sport 22 None Soccer 2 3 Food 33 None Pizza 来获取提取列{/ 3}}:

print (df.pop('col3').apply(pd.io.json.loads))
0      {'col3_1': None, 'col3_2': 'Java'}
1    {'col3_1': None, 'col3_2': 'Soccer'}
2     {'col3_1': None, 'col3_2': 'Pizza'}
Name: col3, dtype: object

print (pd.DataFrame(df.pop('col3').apply(pd.io.json.loads).values.tolist(), index=df.index))
  col3_1  col3_2
0   None    Java
1   None  Soccer
2   None   Pizza

<强>详细

df = pd.concat([df] * 10000, ignore_index=True)

In [204]: %timeit df.join(pd.DataFrame(df['col3'].apply(pd.io.json.loads).values.tolist(), index=df.index))
10 loops, best of 3: 76.4 ms per loop

In [205]: %timeit df.join(df['col3'].apply(lambda x: pd.Series(json.loads(x))))
1 loop, best of 3: 11.3 s per loop

解决方案类似,但性能不同:

void TMooseEngine::toggleFullscreen()
{
    _fullscreen = !_fullscreen;

    glfwDestroyWindow(window);
    delete _shader;
    delete _skybox;
    //delete _particulas;




    glfwWindowHint(GLFW_CONTEXT_VERSION_MAJOR, 3);
    glfwWindowHint(GLFW_CONTEXT_VERSION_MINOR, 3);
    glfwWindowHint(GLFW_OPENGL_PROFILE, GLFW_OPENGL_CORE_PROFILE);

    if(_fullscreen){ //change to fullscreen
       window = glfwCreateWindow(_width, _height, "Fate Warriors", glfwGetPrimaryMonitor(), NULL);
       glfwMakeContextCurrent(window);
       glViewport(0,0,_width,_height);
       //culling
       glEnable(GL_DEPTH_TEST);
       glViewport(0,0,_width,_height);
       glEnable(GL_CULL_FACE);
       glCullFace(GL_BACK); 
       glFrontFace(GL_CCW);
       _shader = new Shader();
       _skybox = new Skybox();

       initUI();
    }

    else{ //change to windowed
        window = glfwCreateWindow(_width, _height, "Fate Warriors", NULL, NULL);
        glfwMakeContextCurrent(window);
        glViewport(0,0,_width,_height);
        //culling
        glEnable(GL_DEPTH_TEST);
        glViewport(0,0,_width,_height);
        glEnable(GL_CULL_FACE);
        glCullFace(GL_BACK); 
        glFrontFace(GL_CCW);
        _shader = new Shader();
        _skybox = new Skybox();

        initUI();
    }
}