我有几个映射规则,这里是规则
Type A: Chicken, Beef, Goat
Type B: Fish, Shrimp
Type C: Chicken, Pork
我想添加此
id Menu
1 Fried Chicken
2 Shrimp Chips
3 Pork with Cheese
4 Fish Spaghetti
5 Goat Sate
6 Beef Soup
我想要像这样制作标签
id Menu Type A Type B Type C
1 Fried Chicken 1 0 1
2 Shrimp Chips 0 1 0
3 Pork with Cheese 0 0 1
4 Fish Spaghetti 0 1 0
5 Goat Sate 1 0 0
6 Beef Soup 1 0 0
答案 0 :(得分:4)
我将您的映射规则转换为pd.MultiIndex
from numpy.core.defchararray import find
m = {
'Type A': ['Chicken', 'Beef', 'Goat'],
'Type B': ['Fish', 'Shrimp'],
'Type C': ['Chicken', 'Pork']
}
mux = pd.MultiIndex.from_tuples(
[(k, v) for k, values in m.items() for v in values])
选项0
使用pd.Series.str.get_dummies
df.join(
df.Menu.str.get_dummies(sep=' ') \
.reindex(columns=mux, level=1).max(axis=1, level=0)
)
id Menu Type A Type B Type C
0 1 Fried Chicken 1 0 1
1 2 Shrimp Chips 0 1 0
2 3 Pork with Cheese 0 0 1
3 4 Fish Spaghetti 0 1 0
4 5 Goat Sate 1 0 0
5 6 Beef Soup 1 0 0
选项1
使用numpy.core.defchararray.find
menu = df.Menu.values.astype(str)
d1 = pd.DataFrame(
(find(menu[:, None], mux.levels[1]) >= 0).astype(int),
columns = mux.levels[1]
)
df.join(d1.reindex(columns=mux, level=1).max(axis=1, level=0))
id Menu Type A Type B Type C
0 1 Fried Chicken 1 0 1
1 2 Shrimp Chips 0 1 0
2 3 Pork with Cheese 0 0 1
3 4 Fish Spaghetti 0 1 0
4 5 Goat Sate 1 0 0
5 6 Beef Soup 1 0 0
选项2
使用pd.Series.str.extractall
d1 = pd.get_dummies(
df.Menu.str.extractall(
'({})'.format('|'.join(mux.levels[1]))
)[0]
).sum(level=0)
df.join(d1.reindex(columns=mux, level=1).max(axis=1, level=0))
id Menu Type A Type B Type C
0 1 Fried Chicken 1 0 1
1 2 Shrimp Chips 0 1 0
2 3 Pork with Cheese 0 0 1
3 4 Fish Spaghetti 0 1 0
4 5 Goat Sate 1 0 0
5 6 Beef Soup 1 0 0
选项1的解释
使用pd.MultiIndex
,我只能使用np.core.defchararray.find
覆盖我正在寻找的所有单词的唯一值,并能够将它们映射回可能的多个键。
mux = pd.MultiIndex.from_tuples(
[(k, v) for k, values in m.items() for v in values])
mux
将如下所示:
Type A Type B Type C
Chicken Beef Goat Fish Shrimp Chicken Pork
但是mux
的唯一值在mux.levels[1]
中。我用它来查找我的值。
d1 = pd.DataFrame(
(find(menu[:, None], mux.levels[1]) >= 0).astype(int),
columns = mux.levels[1]
)
d1
Beef Chicken Fish Goat Pork Shrimp
0 0 1 0 0 0 0
1 0 0 0 0 0 1
2 0 0 0 0 1 0
3 0 0 1 0 0 0
4 0 0 0 1 0 0
5 1 0 0 0 0 0
现在我pd.DataFrame.reindex
和columns
level=1
了
d1.reindex(columns=mux, level=1)
Type A Type B Type C
Chicken Beef Goat Fish Shrimp Chicken Pork
0 1 0 0 0 0 1 0
1 0 0 0 0 1 0 0
2 0 0 0 0 0 0 1
3 0 0 0 1 0 0 0
4 0 0 1 0 0 0 0
5 0 1 0 0 0 0 0
我将max
axis=1
和level=0
以及join
带回来......这就是我在上面展示的内容。
计时
def pir0(df, m):
mux = pd.MultiIndex.from_tuples(
[(k, v) for k, values in m.items() for v in values])
return df.join(
df.Menu.str.get_dummies(sep=' ') \
.reindex(columns=mux, level=1).max(axis=1, level=0)
)
def pir1(df, m):
mux = pd.MultiIndex.from_tuples(
[(k, v) for k, values in m.items() for v in values])
menu = df.Menu.values.astype(str)
d1 = pd.DataFrame(
(find(menu[:, None], mux.levels[1]) >= 0).astype(int),
columns = mux.levels[1]
)
return df.join(d1.reindex(columns=mux, level=1).max(axis=1, level=0))
def pir2(df, m):
mux = pd.MultiIndex.from_tuples(
[(k, v) for k, values in m.items() for v in values])
d1 = pd.get_dummies(
df.Menu.str.extractall(
'({})'.format('|'.join(mux.levels[1]))
)[0]
).sum(level=0)
return df.join(d1.reindex(columns=mux, level=1).max(axis=1, level=0))
def keiku(df, m):
return df.assign(**{k: df.Menu.str.contains('|'.join(m[k])).astype(int) for k in m})
res = pd.DataFrame(
index=[10, 30, 100, 300, 1000],
columns='pir0 pir1 pir2 keiku'.split(),
dtype=float
)
for i in res.index:
d = pd.concat([df] * i, ignore_index=True)
for j in res.columns:
stmt = '{}(d, m)'.format(j)
setp = 'from __main__ import d, m, {}'.format(j)
res.at[i, j] = timeit(stmt, setp, number=20)
res.plot(loglog=True)
答案 1 :(得分:3)
以下代码是一个简单的解决方案。
import pandas as pd
from io import StringIO
m = {
'Type A': ['Chicken', 'Beef', 'Goat'],
'Type B': ['Fish', 'Shrimp'],
'Type C': ['Chicken', 'Pork']
}
csv = StringIO("""id,Menu
1,Fried Chicken
2,Shrimp Chips
3,Pork with Cheese
4,Fish Spaghetti
5,Goat Sate
6,Beef Soup""")
df = pd.read_csv(csv)
for key in m:
df[key] = df["Menu"].str.contains('|'.join(m[key])).astype(int)
df
# Out[3]:
# id Menu Type A Type B Type C
# 0 1 Fried Chicken 1 0 1
# 1 2 Shrimp Chips 0 1 0
# 2 3 Pork with Cheese 0 0 1
# 3 4 Fish Spaghetti 0 1 0
# 4 5 Goat Sate 1 0 0
# 5 6 Beef Soup 1 0 0