熊猫:从存储为列值的列表中创建新列

时间:2019-10-17 11:44:24

标签: python pandas dataframe

我有一个数据框,其列值如下所示:

[
    {
      "OrderID" : "0",
      "TimeStamp" : "2019-09-24 10:17:48 +0000",
      "Screen" : "Home_Screen",
      "StateVars" : "",
      "Event" : "A"
    },
    {
      "Event" : "B",
      "TimeStamp" : "2019-09-24 10:17:38 +0000",
      "Screen" : "Home_Screen",
      "StateVars" : "",
      "OrderID" : "0"
    },
    {
      "OrderID" : "0",
      "TimeStamp" : "2019-09-24 10:17:35 +0000",
      "Screen" : "Home_Screen",
      "StateVars" : "",
      "Event" : "D"
    },
    {
      "Event" : "V",
      "TimeStamp" : "2019-09-24 10:17:33 +0000",
      "Screen" : "Home_Screen",
      "StateVars" : "",
      "OrderID" : "0"
    },
    {
      "OrderID" : "0",
      "TimeStamp" : "2019-09-24 10:17:32 +0000",
      "Screen" : "Home_Screen",
      "StateVars" : "",
      "Event" : "C"
    }
  ]

我要对所有键进行列。 因此,原始数据帧如下所示:


+----+------------+-------------+---------+---------------------------------------+----------------------------------------------------+-------------+------+------+------+------+------+-----+
|    | O          | v           | S       |               I                       |                     EventLog                       | CustomerID  |  a   |  b   |  c   |  d   |  e   |  f  |
+----+------------+-------------+---------+---------------------------------------+----------------------------------------------------+-------------+------+------+------+------+------+-----+
| 0  |      1     |        0.4  |  OS     | 92D42D7E-68F0-4688-83C5-781920E05335  | [{'OrderID': '0', 'TimeStamp': '2019-09-24 10:...  |         1   | NaN  | NaN  | NaN  | NaN  | NaN  | NaN |
| 1  |      1     |        0.4  |  OS     | 92D42D7E-68F0-4688-83C5-781920E05335  | [{'OrderID': '0', 'TimeStamp': '2019-09-24 10:...  |         1   | NaN  | NaN  | NaN  | NaN  | NaN  | NaN |
| 2  |      1     |        0.4  |  OS     | 92D42D7E-68F0-4688-83C5-781920E05335  | [{'OrderID': '0', 'TimeStamp': '2019-09-24 10:...  |         1   | NaN  | NaN  | NaN  | NaN  | NaN  | NaN |
| 3  |      1     |        0.4  |  OS     | 92D42D7E-68F0-4688-83C5-781920E05335  | [{'OrderID': '0', 'TimeStamp': '2019-09-24 10:...  |         1   | NaN  | NaN  | NaN  | NaN  | NaN  | NaN |
| 4  |      1     |        0.4  |  OS     | 92D42D7E-68F0-4688-83C5-781920E05335  | [{'OrderID': '0', 'TimeStamp': '2019-09-24 10:...  |         15  | NaN  | NaN  | NaN  | NaN  | NaN  | NaN |
+----+------------+-------------+---------+---------------------------------------+----------------------------------------------------+-------------+------+------+------+------+------+-----+

我正在寻找类似的东西


+----+------------+-------------+---------+---------------------------------------+----------------------------------------------------+-------------+------+----------------------------+--------------+------------+------+
|    | O          | v           | S       |               I                       |                     EventLog                       | CustomerID  |OrdeID|  TimeStamp                 |Screen        | StarsVar   |Event |
+----+------------+-------------+---------+---------------------------------------+----------------------------------------------------+-------------+------+----------------------------+--------------+------------+------+
| 0  |      1     |        0.4  |  OS     | 92D42D7E-68F0-4688-83C5-781920E05335  | [{'OrderID': '0', 'TimeStamp': '2019-09-24 10:...  |         1   | 0    | 2019-09-24 10:17:33 +0000  | Home_Screen  | NaN        | A    |
| 1  |      1     |        0.4  |  OS     | 92D42D7E-68F0-4688-83C5-781920E05335  | [{'OrderID': '0', 'TimeStamp': '2019-09-24 10:...  |         1   | 0    | 2019-09-24 10:17:33 +0000  | Home_Screen  | NaN        | B    |
| 2  |      1     |        0.4  |  OS     | 92D42D7E-68F0-4688-83C5-781920E05335  | [{'OrderID': '0', 'TimeStamp': '2019-09-24 10:...  |         1   | 0    | 2019-09-24 10:17:33 +0000  | Home_Screen  | NaN        | C    |
| 3  |      1     |        0.4  |  OS     | 92D42D7E-68F0-4688-83C5-781920E05335  | [{'OrderID': '0', 'TimeStamp': '2019-09-24 10:...  |         1   | 0    | 2019-09-24 10:17:33 +0000  | Home_Screen  | NaN        | D    |
| 4  |      1     |        0.4  |  OS     | 92D42D7E-68F0-4688-83C5-781920E05335  | [{'OrderID': '0', 'TimeStamp': '2019-09-24 10:...  |         1   | 0    | 2019-09-24 10:17:33 +0000  | Home_Screen  | NaN        | E    |
+----+------------+-------------+---------+---------------------------------------+----------------------------------------------------+-------------+------+----------------------------+--------------+------------+------+

不一定需要删除上面输出中所示的列。

1 个答案:

答案 0 :(得分:3)

首先由构造函数创建DataFrame

df1 = pd.DataFrame(df['EventLog'].values.tolist())
print (df1)
  OrderID                  TimeStamp       Screen StateVars Event
0       0  2019-09-24 10:17:48 +0000  Home_Screen               A
1       0  2019-09-24 10:17:38 +0000  Home_Screen               B
2       0  2019-09-24 10:17:35 +0000  Home_Screen               D
3       0  2019-09-24 10:17:33 +0000  Home_Screen               V
4       0  2019-09-24 10:17:32 +0000  Home_Screen               C

并添加到原始文件:

df = df.join(df1)
print (df)

编辑:我认为有一些缺失值,因此解决方案是将它们替换为空字典-最终它会创建缺失值:

print (df)
                                            EventLog
0  {'OrderID': '0', 'TimeStamp': '2019-09-24 10:1...
1  {'Event': 'B', 'TimeStamp': '2019-09-24 10:17:...
2  {'OrderID': '0', 'TimeStamp': '2019-09-24 10:1...
3  {'Event': 'V', 'TimeStamp': '2019-09-24 10:17:...
4  {'OrderID': '0', 'TimeStamp': '2019-09-24 10:1...
5                                                NaN

df = pd.DataFrame([x if x ==x else {} for x in df['EventLog']])
print (df)
  OrderID                  TimeStamp       Screen StateVars Event
0       0  2019-09-24 10:17:48 +0000  Home_Screen               A
1       0  2019-09-24 10:17:38 +0000  Home_Screen               B
2       0  2019-09-24 10:17:35 +0000  Home_Screen               D
3       0  2019-09-24 10:17:33 +0000  Home_Screen               V
4       0  2019-09-24 10:17:32 +0000  Home_Screen               C
5     NaN                        NaN          NaN       NaN   NaN

另一种解决方案:

a=df['EventLog'].values.tolist()
a = [x for x in a if x == x]
empty_df=pd.DataFrame()
for i in range(0, len(a)):
    b=a[i]
    for j in range(0, len(b)):
        c=b[j]
        empty_df=empty_df.append(c, ignore_index=True, sort=False)
df = df.join(empty_df)