我正在尝试将具有列表数组的列拆分为多列并创建多行。下面是示例数据:
signalid monthyear readings
5135 201901 [{"v":"90","q":1,"t":1546444800000,"g":0}]
5135 201901 [{"v":"50","q":1,"t":1546444900000,"g":0}]
5135 201901 [{"v":"40","q":1,"t":1546445800000,"g":0}]
5135 201901 [{"v":"30","q":1,"t":1546446800000,"g":0},{"v":"35","q":1,"t":1546446900000,"g":0}]
有关数据的详细信息:
从Cassandra数据库中获取数据。
Cassandra中的列数据类型是文本。
dataframe.readings [0]
'[{“ v”:“ 9817”,“ q”:1,“ t”:1548979150085,“ g”:0},{“ v”:“ 9821”,“ q”:1,“ t “:1548979151475,” g“:0}]'
dataframe.readings.dtype
dtype('O')
规范化代码:
normalizeddataframe = json_normalize(data=dataframe, record_path='readings',
meta=["signalid", "monthyear"])
预期输出:
signalid monthyear v q t g
5135 201901 90 1 1546444800000 0
5135 201901 50 1 1546444900000 0
5135 201901 40 1 1546445800000 0
5135 201901 30 1 1546446800000 0
5135 201901 35 1 1546446900000 0
结果:
TypeError Traceback (most recent call last)
<ipython-input-13-486775fb3431> in <module>
60 #normalizeddataframe = pandas.read_json(dataframe.readings)
61 normalizeddataframe = json_normalize(data=dataframe, record_path='readings',
---> 62 meta=["signalid", "monthyear", "fromtime", "totime", "avg", "insertdate", "max", "min"])
63 #normalizeddataframe = json_normalize(pandas.Series.to_json(dataframe.readings))
64 print("Processig completed");
C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\json\normalize.py in json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep)
260 records.extend(recs)
261
--> 262 _recursive_extract(data, record_path, {}, level=0)
263
264 result = DataFrame(records)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\json\normalize.py in _recursive_extract(data, path, seen_meta, level)
236 else:
237 for obj in data:
--> 238 recs = _pull_field(obj, path[0])
239
240 # For repeating the metadata later
C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\json\normalize.py in _pull_field(js, spec)
183 result = result[field]
184 else:
--> 185 result = result[spec]
186
187 return result
TypeError: string indices must be integers
答案 0 :(得分:1)
我知道这已经有几年了,但我想我会添加我的答案,以防其他人遇到这个问题。这侧重于维护作者指定的输入和输出结构。我确定还有一些优化空间。
import pandas as pd
import json
signalid = [5135, 5135, 5135, 5135]
monthyear = [201901, 201901, 201901, 201901]
readings = ['[{"v":"90","q":1,"t":1546444800000,"g":0}]', '[{"v":"50","q":1,"t":1546444900000,"g":0}]', '[{"v":"40","q":1,"t":1546445800000,"g":0}]', '[{"v":"30","q":1,"t":1546446800000,"g":0},{"v":"35","q":1,"t":1546446900000,"g":0}]']
# Reconstruct sample DataFrame
df = pd.DataFrame({
"signalid": signalid,
"monthyear": monthyear,
"readings": readings
})
# Convert strings to JSON objects
df['readings'] = df['readings'].map(json.loads)
# Can't use nested lists of JSON objects in pd.json_normalize
df = df.explode(column='readings').reset_index(drop=True)
# pd.json_normalize expects a list of JSON objects not a DataFrame
df = pd.concat([df[['signalid', 'monthyear']], pd.json_normalize(df['readings'])], axis=1)
print(df)
输出:
signalid monthyear v q t g
0 5135 201901 90 1 1546444800000 0
1 5135 201901 50 1 1546444900000 0
2 5135 201901 40 1 1546445800000 0
3 5135 201901 30 1 1546446800000 0
4 5135 201901 35 1 1546446900000 0
答案 1 :(得分:0)
假设我们有一个数据列,其列名称为readings
,列值为
[{"v":"90","q":1,"t":1546444800000,"g":0}]
[{"v":"50","q":1,"t":1546444900000,"g":0}]
[{"v":"40","q":1,"t":1546445800000,"g":0}]
for row in data['readings']:
for value_dict in row:
for key in value_dict:
data[key] = value_dict[key]
如果列表中有多个json,则您未提及预期的输出。我认为这可能对您有帮助。如果您提到所需的输出,我可以根据它修改我的代码。
答案 2 :(得分:0)
也许有一个更简单的解决方案,但该解决方案应该可行。想法是将readings
列的每个条目转换为数据帧,其索引由signalid
和monthyear
给出。
此函数将字典转换为DataFrame并设置索引:
def to_df(index, x):
return pd.DataFrame.from_dict(x, orient='index').T.set_index(pd.MultiIndex.from_arrays(index, names=['signalid', 'monthyear']))
下一个函数触发to_df
def concat(y):
return pd.concat(map(lambda x: to_df([[y.signalid], [y.monthyear]], x), y.readings))
我们现在将concat
应用于每一行。结果是一系列数据帧,我们可以将其连接起来以接收最终数据:
s = df.apply(concat, axis=1)
pd.concat(s.tolist())