pandas.Series.str.get_dummies

Question

我有一个谷歌表格，我用来收集调查数据（对于这个问题，我将使用example form），其中有一些问题，可以有多个答案，使用一组复选框进行选择。

当我从表单中获取数据并将其导入pandas时，我得到了这个：

             Timestamp    What sweets do you like?
0  23/11/2013 13:22:30  Chocolate, Toffee, Popcorn
1  23/11/2013 13:22:34                   Chocolate
2  23/11/2013 13:22:39      Toffee, Popcorn, Fruit
3  23/11/2013 13:22:45               Fudge, Toffee
4  23/11/2013 13:22:48                     Popcorn

我想对问题的结果进行统计（有多少人喜欢巧克力，有多少人喜欢太妃糖等）。问题是，所有答案都在一列内，因此按该列分组并要求计数不起作用。

Pandas中是否有一种简单的方法可以将这种数据框转换为多个列，其中包含多个名为Chocolate，Toffee，Popcorn，Fudge和Fruit的列，每个列都是布尔值（1表示是，0表示否）？我想不出一个合理的方法来做到这一点，我不确定它是否真的有用（做我想做的聚合可能会更难以这样做。）

Answer 1

以固定宽度表读入，删除第一列

In [30]: df = pd.read_fwf(StringIO(data),widths=[3,20,27]).drop(['Unnamed: 0'],axis=1)

In [31]: df
Out[31]: 
             Timestamp What sweets do you like0
0  23/11/2013 13:22:34                Chocolate
1  23/11/2013 13:22:39   Toffee, Popcorn, Fruit
2  23/11/2013 13:22:45            Fudge, Toffee
3  23/11/2013 13:22:48                  Popcorn

将时间戳设置为正确的datetime64 dtype（此练习不需要），但几乎总是你想要的。

In [32]: df['Timestamp'] = pd.to_datetime(df['Timestamp'])

新列名称

In [33]: df.columns = ['date','sweets']

In [34]: df
Out[34]: 
                 date                  sweets
0 2013-11-23 13:22:34               Chocolate
1 2013-11-23 13:22:39  Toffee, Popcorn, Fruit
2 2013-11-23 13:22:45           Fudge, Toffee
3 2013-11-23 13:22:48                 Popcorn

In [35]: df.dtypes
Out[35]: 
date      datetime64[ns]
sweets            object
dtype: object

将甜字列从字符串拆分为列表

In [37]: df['sweets'].str.split(',\s*')
Out[37]: 
0                 [Chocolate]
1    [Toffee, Popcorn, Fruit]
2             [Fudge, Toffee]
3                   [Popcorn]
Name: sweets, dtype: object

关键步骤，这将为值存在的位置创建一个虚拟矩阵

In [38]: df['sweets'].str.split(',\s*').apply(lambda x: Series(1,index=x))
Out[38]: 
   Chocolate  Fruit  Fudge  Popcorn  Toffee
0          1    NaN    NaN      NaN     NaN
1        NaN      1    NaN        1       1
2        NaN    NaN      1      NaN       1
3        NaN    NaN    NaN        1     NaN

最终结果我们将nans填充为0，然后将awype填充为bool以使其为True / False。然后连续它到原始框架

In [40]: pd.concat([df,df['sweets'].str.split(',\s*').apply(lambda x: Series(1,index=x)).fillna(0).astype(bool)],axis=1)
Out[40]: 
                 date                  sweets Chocolate  Fruit  Fudge Popcorn Toffee
0 2013-11-23 13:22:34               Chocolate      True  False  False   False  False
1 2013-11-23 13:22:39  Toffee, Popcorn, Fruit     False   True  False    True   True
2 2013-11-23 13:22:45           Fudge, Toffee     False  False   True   False   True
3 2013-11-23 13:22:48                 Popcorn     False  False  False    True  False

Answer 2

几天前，我遇到了同样的问题，经过一些搜索，我在熊猫文档中发现了str.get_dummies函数。让我们看看它是如何工作的：

pandas.Series.str.get_dummies

如文档中所述，test.replaceAll("\\d(?!\\d{0,3}\$)", "*")用 sep 分割系列中的每个字符串，并返回一个虚拟/指标变量的DataFrame。

以下是上述DataFrame的简化版本：

str.get_dummies

我们需要在In [27]: df Out[27]: What sweets do you like? 0 Chocolate, Toffee, Popcorn 1 Chocolate 2 Toffee, Popcorn, Fruit 3 Fudge, Toffee 4 Popcorn中指定的唯一参数是 sep ，在我们的例子中是逗号：

str.get_dummies

注意：

请注意， sep 参数中逗号后有一个空格，因为空格本身是字符，如果我们不将其包含在 sep 中，结果将是如下所示，显然是错误：

In [28]: df['What sweets do you like?'].str.get_dummies(sep=', ')
Out[28]: 
   Chocolate  Fruit  Fudge  Popcorn  Toffee
0          1      0      0        1       1
1          1      0      0        0       0
2          0      1      0        1       1
3          0      0      1        0       1
4          0      0      0        1       0

根据经验，请始终注意准确编写分隔符！

Answer 3

这样的事情怎么样：

#Create some data
import pandas as pd
import numpy as np
Foods = ['Chocolate, Toffee, Popcorn', 'Chocolate', 'Toffee, Popcorn, Fruit', 'Fudge,     Toffee', 'Popcorn']
Dates = ['23/11/2013 13:22:30', '23/11/2013 13:22:34', '23/11/2013 13:22:39', '23/11/2013 13:22:45', '23/11/2013 13:22:48']
DF = pd.DataFrame(Foods, index = Dates, columns = ['Sweets'])

#create unique list of foods
UniqueFoods = ['Chocolate', 'Toffee', 'Popcorn', 'Fruit']

# Create new data frame withy columns for each food type, with indenitcal index
DFTransformed = pd.DataFrame({Food: 0 for Food in UniqueFoods}, index = DF.index)

 #iterate through your data and modify the second data frame according to your wishes
for row in DF.index:
    for Food in UniqueFoods:
        if Food in DF['Sweets'][row]:
            DFTransformed[Food][row] = 1
DFTransformed

使用pandas处理多个答案调查问卷（来自Google Forms）

3 个答案:

pandas.Series.str.get_dummies

注意：