基于熊猫数据框中的单词包的Fitration

时间:2017-10-09 05:13:13

标签: python pandas function multi-index

我有几个映射规则,这里是规则

Type A: Chicken, Beef, Goat
Type B: Fish, Shrimp
Type C: Chicken, Pork

我想添加此

 id   Menu
   1    Fried Chicken
   2    Shrimp Chips
   3    Pork with Cheese
   4    Fish Spaghetti
   5    Goat Sate
   6    Beef Soup

我想要像这样制作标签

 id     Menu                 Type A   Type B   Type C
   1    Fried Chicken        1        0        1
   2    Shrimp Chips         0        1        0
   3    Pork with Cheese     0        0        1
   4    Fish Spaghetti       0        1        0
   5    Goat Sate            1        0        0
   6    Beef Soup            1        0        0

2 个答案:

答案 0 :(得分:4)

我将您的映射规则转换为pd.MultiIndex

from numpy.core.defchararray import find

m = {
    'Type A': ['Chicken', 'Beef', 'Goat'],
    'Type B': ['Fish', 'Shrimp'],
    'Type C': ['Chicken', 'Pork']
}

mux = pd.MultiIndex.from_tuples(
    [(k, v) for k, values in m.items() for v in values])

选项0
使用pd.Series.str.get_dummies

的最简单方法
df.join(
    df.Menu.str.get_dummies(sep=' ') \
      .reindex(columns=mux, level=1).max(axis=1, level=0)
)

   id              Menu  Type A  Type B  Type C
0   1     Fried Chicken       1       0       1
1   2      Shrimp Chips       0       1       0
2   3  Pork with Cheese       0       0       1
3   4    Fish Spaghetti       0       1       0
4   5         Goat Sate       1       0       0
5   6         Beef Soup       1       0       0

选项1
使用numpy.core.defchararray.find

menu = df.Menu.values.astype(str)

d1 = pd.DataFrame(
    (find(menu[:, None], mux.levels[1]) >= 0).astype(int),
    columns = mux.levels[1]
)

df.join(d1.reindex(columns=mux, level=1).max(axis=1, level=0))

   id              Menu  Type A  Type B  Type C
0   1     Fried Chicken       1       0       1
1   2      Shrimp Chips       0       1       0
2   3  Pork with Cheese       0       0       1
3   4    Fish Spaghetti       0       1       0
4   5         Goat Sate       1       0       0
5   6         Beef Soup       1       0       0

选项2
使用pd.Series.str.extractall

d1 = pd.get_dummies(
    df.Menu.str.extractall(
        '({})'.format('|'.join(mux.levels[1]))
    )[0]
).sum(level=0)

df.join(d1.reindex(columns=mux, level=1).max(axis=1, level=0))

   id              Menu  Type A  Type B  Type C
0   1     Fried Chicken       1       0       1
1   2      Shrimp Chips       0       1       0
2   3  Pork with Cheese       0       0       1
3   4    Fish Spaghetti       0       1       0
4   5         Goat Sate       1       0       0
5   6         Beef Soup       1       0       0

选项1的解释
使用pd.MultiIndex,我只能使用np.core.defchararray.find覆盖我正在寻找的所有单词的唯一值,并能够将它们映射回可能的多个键。

mux = pd.MultiIndex.from_tuples(
    [(k, v) for k, values in m.items() for v in values])

mux将如下所示:

 Type A           Type B         Type C     
Chicken Beef Goat   Fish Shrimp Chicken Pork

但是mux的唯一值在mux.levels[1]中。我用它来查找我的值。

d1 = pd.DataFrame(
    (find(menu[:, None], mux.levels[1]) >= 0).astype(int),
    columns = mux.levels[1]
)

d1

   Beef  Chicken  Fish  Goat  Pork  Shrimp
0     0        1     0     0     0       0
1     0        0     0     0     0       1
2     0        0     0     0     1       0
3     0        0     1     0     0       0
4     0        0     0     1     0       0
5     1        0     0     0     0       0

现在我pd.DataFrame.reindexcolumns

level=1
d1.reindex(columns=mux, level=1)

   Type A           Type B         Type C     
  Chicken Beef Goat   Fish Shrimp Chicken Pork
0       1    0    0      0      0       1    0
1       0    0    0      0      1       0    0
2       0    0    0      0      0       0    1
3       0    0    0      1      0       0    0
4       0    0    1      0      0       0    0
5       0    1    0      0      0       0    0

我将max axis=1level=0以及join带回来......这就是我在上面展示的内容。

计时

enter image description here

def pir0(df, m):
    mux = pd.MultiIndex.from_tuples(
        [(k, v) for k, values in m.items() for v in values])

    return df.join(
        df.Menu.str.get_dummies(sep=' ') \
          .reindex(columns=mux, level=1).max(axis=1, level=0)
    )

def pir1(df, m):
    mux = pd.MultiIndex.from_tuples(
        [(k, v) for k, values in m.items() for v in values])

    menu = df.Menu.values.astype(str)

    d1 = pd.DataFrame(
        (find(menu[:, None], mux.levels[1]) >= 0).astype(int),
        columns = mux.levels[1]
    )

    return df.join(d1.reindex(columns=mux, level=1).max(axis=1, level=0))

def pir2(df, m):
    mux = pd.MultiIndex.from_tuples(
        [(k, v) for k, values in m.items() for v in values])

    d1 = pd.get_dummies(
        df.Menu.str.extractall(
            '({})'.format('|'.join(mux.levels[1]))
        )[0]
    ).sum(level=0)

    return df.join(d1.reindex(columns=mux, level=1).max(axis=1, level=0))


def keiku(df, m):
    return df.assign(**{k: df.Menu.str.contains('|'.join(m[k])).astype(int) for k in m})


res = pd.DataFrame(
    index=[10, 30, 100, 300, 1000],
    columns='pir0 pir1 pir2 keiku'.split(),
    dtype=float
)

for i in res.index:
    d = pd.concat([df] * i, ignore_index=True)
    for j in res.columns:
        stmt = '{}(d, m)'.format(j)
        setp = 'from __main__ import d, m, {}'.format(j)
        res.at[i, j] = timeit(stmt, setp, number=20)

res.plot(loglog=True)

答案 1 :(得分:3)

以下代码是一个简单的解决方案。

import pandas as pd
from io import StringIO

m = {
    'Type A': ['Chicken', 'Beef', 'Goat'],
    'Type B': ['Fish', 'Shrimp'],
    'Type C': ['Chicken', 'Pork']
}

csv = StringIO("""id,Menu
1,Fried Chicken
2,Shrimp Chips
3,Pork with Cheese
4,Fish Spaghetti
5,Goat Sate
6,Beef Soup""")
df = pd.read_csv(csv)

for key in m:
    df[key] = df["Menu"].str.contains('|'.join(m[key])).astype(int)

df
# Out[3]: 
#    id              Menu  Type A  Type B  Type C
# 0   1     Fried Chicken       1       0       1
# 1   2      Shrimp Chips       0       1       0
# 2   3  Pork with Cheese       0       0       1
# 3   4    Fish Spaghetti       0       1       0
# 4   5         Goat Sate       1       0       0
# 5   6         Beef Soup       1       0       0