如何标准化pandas数据框的JSON字符串类型列?

时间:2019-02-06 03:36:41

标签: python pandas

我正在尝试将具有列表数组的列拆分为多列并创建多行。下面是示例数据:

signalid    monthyear   readings
5135         201901    [{"v":"90","q":1,"t":1546444800000,"g":0}]
5135         201901    [{"v":"50","q":1,"t":1546444900000,"g":0}]
5135         201901    [{"v":"40","q":1,"t":1546445800000,"g":0}]
5135         201901    [{"v":"30","q":1,"t":1546446800000,"g":0},{"v":"35","q":1,"t":1546446900000,"g":0}]

有关数据的详细信息:

  1. 从Cassandra数据库中获取数据。

  2. Cassandra中的列数据类型是文本。

  3. dataframe.readings [0]

  

'[{“ v”:“ 9817”,“ q”:1,“ t”:1548979150085,“ g”:0},{“ v”:“ 9821”,“ q”:1,“ t “:1548979151475,” g“:0}]'

  1. dataframe.readings.dtype

    dtype('O')

规范化代码:

normalizeddataframe = json_normalize(data=dataframe, record_path='readings', 
                            meta=["signalid", "monthyear"])

预期输出:

signalid    monthyear  v     q         t           g
5135         201901    90    1    1546444800000    0
5135         201901    50    1    1546444900000    0
5135         201901    40    1    1546445800000    0
5135         201901    30    1    1546446800000    0
5135         201901    35    1    1546446900000    0

结果:

TypeError                                 Traceback (most recent call last)
<ipython-input-13-486775fb3431> in <module>
     60 #normalizeddataframe = pandas.read_json(dataframe.readings)
     61 normalizeddataframe = json_normalize(data=dataframe, record_path='readings', 
---> 62                             meta=["signalid", "monthyear", "fromtime", "totime", "avg", "insertdate", "max", "min"])
     63 #normalizeddataframe = json_normalize(pandas.Series.to_json(dataframe.readings))
     64 print("Processig completed");

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\json\normalize.py in json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep)
    260                 records.extend(recs)
    261 
--> 262     _recursive_extract(data, record_path, {}, level=0)
    263 
    264     result = DataFrame(records)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\json\normalize.py in _recursive_extract(data, path, seen_meta, level)
    236         else:
    237             for obj in data:
--> 238                 recs = _pull_field(obj, path[0])
    239 
    240                 # For repeating the metadata later

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\json\normalize.py in _pull_field(js, spec)
    183                 result = result[field]
    184         else:
--> 185             result = result[spec]
    186 
    187         return result

TypeError: string indices must be integers

3 个答案:

答案 0 :(得分:1)

我知道这已经有几年了,但我想我会添加我的答案,以防其他人遇到这个问题。这侧重于维护作者指定的输入和输出结构。我确定还有一些优化空间。

import pandas as pd
import json

signalid = [5135, 5135, 5135, 5135]
monthyear = [201901, 201901, 201901, 201901]
readings = ['[{"v":"90","q":1,"t":1546444800000,"g":0}]', '[{"v":"50","q":1,"t":1546444900000,"g":0}]', '[{"v":"40","q":1,"t":1546445800000,"g":0}]', '[{"v":"30","q":1,"t":1546446800000,"g":0},{"v":"35","q":1,"t":1546446900000,"g":0}]']

# Reconstruct sample DataFrame
df = pd.DataFrame({
        "signalid": signalid,
        "monthyear": monthyear,
        "readings": readings
    })

# Convert strings to JSON objects
df['readings'] = df['readings'].map(json.loads)

# Can't use nested lists of JSON objects in pd.json_normalize
df = df.explode(column='readings').reset_index(drop=True)

# pd.json_normalize expects a list of JSON objects not a DataFrame
df = pd.concat([df[['signalid', 'monthyear']], pd.json_normalize(df['readings'])], axis=1) 

print(df)

输出:

   signalid  monthyear   v  q              t  g
0      5135     201901  90  1  1546444800000  0
1      5135     201901  50  1  1546444900000  0
2      5135     201901  40  1  1546445800000  0
3      5135     201901  30  1  1546446800000  0
4      5135     201901  35  1  1546446900000  0

答案 1 :(得分:0)

假设我们有一个数据列,其列名称为readings,列值为

[{"v":"90","q":1,"t":1546444800000,"g":0}]

[{"v":"50","q":1,"t":1546444900000,"g":0}]

[{"v":"40","q":1,"t":1546445800000,"g":0}]

for row in data['readings']:
    for value_dict in row:
        for key in value_dict:
            data[key] = value_dict[key]

如果列表中有多个json,则您未提及预期的输出。我认为这可能对您有帮助。如果您提到所需的输出,我可以根据它修改我的代码。

答案 2 :(得分:0)

也许有一个更简单的解决方案,但该解决方案应该可行。想法是将readings列的每个条目转换为数据帧,其索引由signalidmonthyear给出。

此函数将字典转换为DataFrame并设置索引:

def to_df(index, x):
    return pd.DataFrame.from_dict(x, orient='index').T.set_index(pd.MultiIndex.from_arrays(index, names=['signalid', 'monthyear']))

下一个函数触发to_df

的调用
def concat(y):
    return pd.concat(map(lambda x: to_df([[y.signalid], [y.monthyear]], x), y.readings))

我们现在将concat应用于每一行。结果是一系列数据帧,我们可以将其连接起来以接收最终数据:

s = df.apply(concat, axis=1)
pd.concat(s.tolist())