我有一个数据框,其列值如下所示:
[
{
"OrderID" : "0",
"TimeStamp" : "2019-09-24 10:17:48 +0000",
"Screen" : "Home_Screen",
"StateVars" : "",
"Event" : "A"
},
{
"Event" : "B",
"TimeStamp" : "2019-09-24 10:17:38 +0000",
"Screen" : "Home_Screen",
"StateVars" : "",
"OrderID" : "0"
},
{
"OrderID" : "0",
"TimeStamp" : "2019-09-24 10:17:35 +0000",
"Screen" : "Home_Screen",
"StateVars" : "",
"Event" : "D"
},
{
"Event" : "V",
"TimeStamp" : "2019-09-24 10:17:33 +0000",
"Screen" : "Home_Screen",
"StateVars" : "",
"OrderID" : "0"
},
{
"OrderID" : "0",
"TimeStamp" : "2019-09-24 10:17:32 +0000",
"Screen" : "Home_Screen",
"StateVars" : "",
"Event" : "C"
}
]
我要对所有键进行列。 因此,原始数据帧如下所示:
+----+------------+-------------+---------+---------------------------------------+----------------------------------------------------+-------------+------+------+------+------+------+-----+
| | O | v | S | I | EventLog | CustomerID | a | b | c | d | e | f |
+----+------------+-------------+---------+---------------------------------------+----------------------------------------------------+-------------+------+------+------+------+------+-----+
| 0 | 1 | 0.4 | OS | 92D42D7E-68F0-4688-83C5-781920E05335 | [{'OrderID': '0', 'TimeStamp': '2019-09-24 10:... | 1 | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 1 | 0.4 | OS | 92D42D7E-68F0-4688-83C5-781920E05335 | [{'OrderID': '0', 'TimeStamp': '2019-09-24 10:... | 1 | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 1 | 0.4 | OS | 92D42D7E-68F0-4688-83C5-781920E05335 | [{'OrderID': '0', 'TimeStamp': '2019-09-24 10:... | 1 | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 1 | 0.4 | OS | 92D42D7E-68F0-4688-83C5-781920E05335 | [{'OrderID': '0', 'TimeStamp': '2019-09-24 10:... | 1 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 1 | 0.4 | OS | 92D42D7E-68F0-4688-83C5-781920E05335 | [{'OrderID': '0', 'TimeStamp': '2019-09-24 10:... | 15 | NaN | NaN | NaN | NaN | NaN | NaN |
+----+------------+-------------+---------+---------------------------------------+----------------------------------------------------+-------------+------+------+------+------+------+-----+
我正在寻找类似的东西
+----+------------+-------------+---------+---------------------------------------+----------------------------------------------------+-------------+------+----------------------------+--------------+------------+------+
| | O | v | S | I | EventLog | CustomerID |OrdeID| TimeStamp |Screen | StarsVar |Event |
+----+------------+-------------+---------+---------------------------------------+----------------------------------------------------+-------------+------+----------------------------+--------------+------------+------+
| 0 | 1 | 0.4 | OS | 92D42D7E-68F0-4688-83C5-781920E05335 | [{'OrderID': '0', 'TimeStamp': '2019-09-24 10:... | 1 | 0 | 2019-09-24 10:17:33 +0000 | Home_Screen | NaN | A |
| 1 | 1 | 0.4 | OS | 92D42D7E-68F0-4688-83C5-781920E05335 | [{'OrderID': '0', 'TimeStamp': '2019-09-24 10:... | 1 | 0 | 2019-09-24 10:17:33 +0000 | Home_Screen | NaN | B |
| 2 | 1 | 0.4 | OS | 92D42D7E-68F0-4688-83C5-781920E05335 | [{'OrderID': '0', 'TimeStamp': '2019-09-24 10:... | 1 | 0 | 2019-09-24 10:17:33 +0000 | Home_Screen | NaN | C |
| 3 | 1 | 0.4 | OS | 92D42D7E-68F0-4688-83C5-781920E05335 | [{'OrderID': '0', 'TimeStamp': '2019-09-24 10:... | 1 | 0 | 2019-09-24 10:17:33 +0000 | Home_Screen | NaN | D |
| 4 | 1 | 0.4 | OS | 92D42D7E-68F0-4688-83C5-781920E05335 | [{'OrderID': '0', 'TimeStamp': '2019-09-24 10:... | 1 | 0 | 2019-09-24 10:17:33 +0000 | Home_Screen | NaN | E |
+----+------------+-------------+---------+---------------------------------------+----------------------------------------------------+-------------+------+----------------------------+--------------+------------+------+
不一定需要删除上面输出中所示的列。
答案 0 :(得分:3)
首先由构造函数创建DataFrame
:
df1 = pd.DataFrame(df['EventLog'].values.tolist())
print (df1)
OrderID TimeStamp Screen StateVars Event
0 0 2019-09-24 10:17:48 +0000 Home_Screen A
1 0 2019-09-24 10:17:38 +0000 Home_Screen B
2 0 2019-09-24 10:17:35 +0000 Home_Screen D
3 0 2019-09-24 10:17:33 +0000 Home_Screen V
4 0 2019-09-24 10:17:32 +0000 Home_Screen C
并添加到原始文件:
df = df.join(df1)
print (df)
编辑:我认为有一些缺失值,因此解决方案是将它们替换为空字典-最终它会创建缺失值:
print (df)
EventLog
0 {'OrderID': '0', 'TimeStamp': '2019-09-24 10:1...
1 {'Event': 'B', 'TimeStamp': '2019-09-24 10:17:...
2 {'OrderID': '0', 'TimeStamp': '2019-09-24 10:1...
3 {'Event': 'V', 'TimeStamp': '2019-09-24 10:17:...
4 {'OrderID': '0', 'TimeStamp': '2019-09-24 10:1...
5 NaN
df = pd.DataFrame([x if x ==x else {} for x in df['EventLog']])
print (df)
OrderID TimeStamp Screen StateVars Event
0 0 2019-09-24 10:17:48 +0000 Home_Screen A
1 0 2019-09-24 10:17:38 +0000 Home_Screen B
2 0 2019-09-24 10:17:35 +0000 Home_Screen D
3 0 2019-09-24 10:17:33 +0000 Home_Screen V
4 0 2019-09-24 10:17:32 +0000 Home_Screen C
5 NaN NaN NaN NaN NaN
另一种解决方案:
a=df['EventLog'].values.tolist()
a = [x for x in a if x == x]
empty_df=pd.DataFrame()
for i in range(0, len(a)):
b=a[i]
for j in range(0, len(b)):
c=b[j]
empty_df=empty_df.append(c, ignore_index=True, sort=False)
df = df.join(empty_df)