考虑一个
的片段{
"participant_id": 37,
"response_date": "2016-05-19T07:19:32.620Z",
"data": {
"summary": 8,
"q6": [
"1",
"2"
],
"q1": 0,
"q2": 1,
"q3": 1,
"q4": 2,
"q5": 2
}
},
{
"participant_id": 37,
"response_date": "2016-05-26T07:14:24.7130Z",
"data": {
"summary": 8,
"q6": [
"1",
"2",
"4"
],
"q1": 0,
"q2": 1,
"q3": 1,
"q4": 2,
"q5": 2
}
}
将产生一个Pandas数据框:
0 q1 q2 q3 q4 q5 q6 summary participant_id response_date
672 NaN 0.0 1.0 1.0 2.0 2.0 [1, 2] 8.0 37 2016-05-19 07:19:32.620
711 NaN 0.0 1.0 1.0 2.0 2.0 [1, 2, 4] 7.0 37 2016-05-26 07:14:24.713
如何将嵌套的q6
扩展为“更宽”的格式?此属性q6
最多可包含4个可能的值。所以,理想情况下应该是:
0 q1 q2 q3 q4 q5 q6 q7 q8 q9 summary participant_id response_date
672 NaN 0.0 1.0 1.0 2.0 2.0 1.0 1.0 0.0 0.0 8.0 37 2016-05-19 07:19:32.620
711 NaN 0.0 1.0 1.0 2.0 2.0 1.0 1.0 0.0 1.0 7.0 37 2016-05-26 07:14:24.713
因此,基本上,方括号中的数字编码4个元素数组中1
的位置。
是否有一个简单的Pandasian解决方案?
修改
某些条目被错误地反转或随机记录(第1行和第3行):
0 q1 q2 q3 q4 q5 q6 summary participant_id response_date
672 NaN 0.0 1.0 1.0 2.0 2.0 [1, 2] 8.0 37 2016-05-19 07:19:32.620
711 NaN 0.0 1.0 1.0 2.0 2.0 [1] 7.0 37 2016-05-20 07:14:24.713
740 NaN 0.0 1.0 1.0 2.0 2.0 [2, 1] 8.0 37 2016-05-21 07:10:17.251
774 NaN 0.0 1.0 1.0 1.0 3.0 [1, 2] 8.0 37 2016-05-22 08:28:14.579
809 NaN 0.0 1.0 1.0 1.0 3.0 [1, 2] 8.0 37 2016-05-23 07:30:27.259
在进行任何进一步的操作之前,应对它们进行排序。
答案 0 :(得分:1)
import pandas as pd
import numpy as np
df = pd.DataFrame({"q6":[[1,2],[1,2,4]]})
a = df.q6.values.tolist()
e = [np.array(x)-1 for x in a]
b = np.zeros((len(e),4))
for i in range(0, len(e)):
b[i][e[i]] = 1
c = np.hsplit(b,4)
df1 = pd.DataFrame({'q6':c[0],'q7':c[1],'q8':c[2],'q9':c[3]})
最后,我用你想要的4个列创建了其他数据框。
答案 1 :(得分:1)
我认为这并不容易。
DataFrame
+ get_dummies
for new df reindex
用于添加缺失值+ rename
列
concat
原创,(q6
)列已删除
startswith
+ reindex_axis
新订单列df1 = pd.get_dummies(pd.DataFrame(df['q6'].values.tolist()), prefix_sep='', prefix='')
df1.columns = df1.columns.astype(int)
df1 =df1.reindex(columns=range(1,5),fill_value=0).rename(columns=lambda x: 'q{}'.format(x+5))
print (df1)
q6 q7 q8 q9
0 1 1 0 0
1 1 1 0 1
df = pd.concat([df.drop('q6', axis=1), df1], axis=1)
mask = df.columns.str.startswith('q', na=False)
cols1 = df.columns[mask].tolist()
cols2 = df.columns[~mask].tolist()
cols = cols2[:1] + cols1 + cols2[1:]
df = df.reindex_axis(cols, axis=1)
print (df)
0 q1 q2 q3 q4 q5 q6 q7 q8 q9 summary participant_id \
0 NaN 0 1 1 2 2 1 1 0 0 8 37
1 NaN 0 1 1 2 2 1 1 0 1 8 37
response_date
0 2016-05-19T07:19:32.620Z
1 2016-05-26T07:14:24.7130Z
答案 2 :(得分:1)
>>> df1
participant_id q1 q2 q3 q4 q5 q6 response_date summary
0 37 0 1 1 2 2 [1, 2] 2016-05-19T07:19:32.620Z 8
1 37 0 1 1 2 2 [1, 2, 4] 2016-05-26T07:14:24.7130Z 8
>>> def widen(oldQ6):
... l = [0.0, 0.0, 0.0, 0.0]
... for i in oldQ6:
... l[int(i)-1] = 1.0
... return l
...
>>> df1['q6'], df1['q7'], df1['q8'], df1['q9'] = zip(*df1['q6'].map(widen))
>>> df1
participant_id q1 q2 q3 q4 q5 q6 response_date summary q7 q8 q9
0 37 0 1 1 2 2 1 2016-05-19T07:19:32.620Z 8 1 0 0
1 37 0 1 1 2 2 1 2016-05-26T07:14:24.7130Z 8 1 0 1