将pandas对象展平为列

时间:2016-12-27 05:18:07

标签: python pandas dataframe

我正在尝试从DataFrame中展平列表。我现有的DataFrame看起来像这样:

CreationDate
2013-12-22 15:25:02                    <ubuntu><mac-osx><syslinux>
2009-12-14 14:29:32    <ubuntu><mod-rewrite><laconica><apache-2.2>
2013-12-22 15:42:00                 <ubuntu><nat><squid><mikrotik>
Name: Tags, dtype: object

然后,我清理Tags列中的标记字符串:

def tag_cleaner(s):
    s0 = "".join(s.split("<")).split(">")
    return [i for i in s0 if i != ""]

df["Tags"] = df["Tags"].apply(lambda t: tag_cleaner(t))
df["NumTags"] = df["Tags"].apply(lambda x: len(x))

结果如下:

CreationDate
2013-12-22 15:25:02                  [ubuntu, mac-osx, syslinux]        3
2009-12-14 14:29:32  [ubuntu, mod-rewrite, laconica, apache-2.2]        4
2013-12-22 15:42:00               [ubuntu, nat, squid, mikrotik]        4

现在,我为每个标记创建了新列:

tag_df = pd.DataFrame(index=df.index, data=df["Tags"])
max_cols = tag_df["Tags"].map(len).max()
for col in range(max_cols):
    tag_df[col] = pd.Series(index=tag_df.index)

这给了我这个:

CreationDate
2013-12-22 15:25:02                  [ubuntu, mac-osx, syslinux] NaN NaN NaN NaN NaN
2009-12-14 14:29:32  [ubuntu, mod-rewrite, laconica, apache-2.2] NaN NaN NaN NaN NaN
2013-12-22 15:42:00               [ubuntu, nat, squid, mikrotik] NaN NaN NaN NaN NaN

对于Tags列中的每个标记,我想在其适当的“索引”列中插入标记。所以,最终结果应如下所示:

CreationDate
2013-12-22 15:25:02                  [ubuntu, mac-osx, syslinux] ubuntu     mac-osx syslinux        NaN NaN
2009-12-14 14:29:32  [ubuntu, mod-rewrite, laconica, apache-2.2] ubuntu mod-rewrite laconica apache-2.2 NaN
2013-12-22 15:42:00               [ubuntu, nat, squid, mikrotik] ubuntu         nat    squid   mikrotik NaN

我尝试了pd.DataFrame.insert()以及创建新DataFrame并将它们合并在一起的不同形式,我似乎无法找到合适的组合。如何将Tags列中的每个对象展平到同一行上的相应列?

2 个答案:

答案 0 :(得分:2)

在这种情况下我会使用.str.extractall()方法:

In [57]: df
Out[57]:
         CreationDate                                         Tags
0 2013-12-22 15:25:02                  <ubuntu><mac-osx><syslinux>
1 2009-12-14 14:29:32  <ubuntu><mod-rewrite><laconica><apache-2.2>
2 2013-12-22 15:42:00               <ubuntu><nat><squid><mikrotik>

In [58]: x = df.pop('Tags').str.extractall(r'\<(.*?)\>').unstack()

In [59]: x.columns = x.columns.droplevel(0)

In [60]: df.join(x)
Out[60]:
         CreationDate       0            1         2           3
0 2013-12-22 15:25:02  ubuntu      mac-osx  syslinux        None
1 2009-12-14 14:29:32  ubuntu  mod-rewrite  laconica  apache-2.2
2 2013-12-22 15:42:00  ubuntu          nat     squid    mikrotik

更新:假设数据是系列,而不是数据框:

In [14]: s
Out[14]:
CreationDate
2013-12-22 15:25:02                    <ubuntu><mac-osx><syslinux>
2009-12-14 14:29:32    <ubuntu><mod-rewrite><laconica><apache-2.2>
2013-12-22 15:42:00                 <ubuntu><nat><squid><mikrotik>
Name: Tags, dtype: object

In [15]: type(s)
Out[15]: pandas.core.series.Series

In [16]: x = s.str.extractall(r'\<(.*?)\>').unstack().rename_axis(None)

In [17]: x.columns = x.columns.droplevel(0)

In [18]: x
Out[18]:
match                     0            1         2           3
2009-12-14 14:29:32  ubuntu  mod-rewrite  laconica  apache-2.2
2013-12-22 15:25:02  ubuntu      mac-osx  syslinux        None
2013-12-22 15:42:00  ubuntu          nat     squid    mikrotik

答案 1 :(得分:1)

获取长度和转换为列表的部分解决方案。

df.Tags = df.Tags.str.strip('<>')
df.Tags = df.Tags.str.split('><')
df['NumTags'] = df.Tags.apply(lambda x: len(x))

工作解决方案
只需注释掉评论并复制到剪贴板,然后再将其评论回来。然后运行代码。

import pandas as pd

# CreationDate
# 2013-12-22 15:25:02                    <ubuntu><mac-osx><syslinux>
# 2009-12-14 14:29:32    <ubuntu><mod-rewrite><laconica><apache-2.2>
# 2013-12-22 15:42:00                 <ubuntu><nat><squid><mikrotik>
df= pd.read_clipboard()
df2= df.copy()
df2.CreationDate = df2.CreationDate.str.strip('<>')
df2.CreationDate = df2.CreationDate.str.split('><')
df2['Length'] = df2.CreationDate.apply(lambda x: len(x))

for a in range(df2.Length.max()):
    df2[a]=df2.CreationDate.apply(lambda x: x[a] if a<len(x) else 'NaN')
df2

输出:

enter image description here