我有一个csv文件,如下所示:
index,labels
1,created the tower
2,destroyed the tower
3,created the swimming pool
4,destroyed the swimming pool
现在,如果我传递我想要的列表代替标签列(不包含标签列中的所有单词)
['created','tower','destroyed','swimming pool']
我想获取数据框:
index,created,destroyed,tower,swimming pool
1,1,0,1,0
2,0,1,1,0
3,1,0,0,1
4,0,1,0,1
我调查了get_dummies,但那并没有帮助
答案 0 :(得分:9)
import re
import pandas as pd
df = pd.DataFrame({'index': [1, 2, 3, 4], 'labels': ['created the tower', 'destroyed the tower', 'created the swimming pool', 'destroyed the swimming pool']})
columns = ['created','destroyed','tower','swimming pool']
pat = '|'.join(['({})'.format(re.escape(c)) for c in columns])
result = (df['labels'].str.extractall(pat)).groupby(level=0).count()
result.columns = columns
print(result)
产量
created destroyed tower swimming pool
0 1 0 1 0
1 0 1 1 0
2 1 0 0 1
3 0 1 0 1
大部分工作由str.extractall
完成:
In [808]: df['labels'].str.extractall(r'(created)|(destroyed)|(tower)|(swimming pool)')
Out[808]:
0 1 2 3
match
0 0 created NaN NaN NaN
1 NaN NaN tower NaN
1 0 NaN destroyed NaN NaN
1 NaN NaN tower NaN
2 0 created NaN NaN NaN
1 NaN NaN NaN swimming pool
3 0 NaN destroyed NaN NaN
1 NaN NaN NaN swimming pool
由于每个匹配都放在它自己的行上,因此可以通过执行groupby/count
操作来获得所需的结果,其中我们按索引的第一级(原始索引)进行分组。
请注意,Python re
模块对允许的命名组数量有硬编码限制:
/usr/lib/python3.4/sre_compile.py in compile(p, flags)
577 if p.pattern.groups > 100:
578 raise AssertionError(
--> 579 "sorry, but this version only supports 100 named groups"
580 )
581
AssertionError: sorry, but this version only supports 100 named groups
这会将上面使用的extractall
方法限制为最多100个关键字。
这是一个基准测试,表明cᴏʟᴅsᴘᴇᴇᴅ的解决方案(至少在一定范围的用例中)可能是最快的:
In [76]: %timeit using_contains(ser, keywords)
10 loops, best of 3: 63.4 ms per loop
In [77]: %timeit using_defchararray(ser, keywords)
10 loops, best of 3: 90.6 ms per loop
In [78]: %timeit using_extractall(ser, keywords)
10 loops, best of 3: 126 ms per loop
以下是我使用的设置:
import string
import numpy as np
import pandas as pd
def using_defchararray(ser, keywords):
"""
https://stackoverflow.com/a/46046558/190597 (piRSquared)
"""
v = ser.values.astype(str)
# >>> (np.core.defchararray.find(v[:, None], columns) >= 0)
# array([[ True, False, True, False],
# [False, True, True, False],
# [ True, False, False, True],
# [False, True, False, True]], dtype=bool)
result = pd.DataFrame(
(np.core.defchararray.find(v[:, None], keywords) >= 0).astype(int),
index=ser.index, columns=keywords)
return result
def using_extractall(ser, keywords):
"""
https://stackoverflow.com/a/46046417/190597 (unutbu)
"""
pat = '|'.join(['({})'.format(re.escape(c)) for c in keywords])
result = (ser.str.extractall(pat)).groupby(level=0).count()
result.columns = keywords
return result
def using_contains(ser, keywords):
"""
https://stackoverflow.com/a/46046142/190597 (cᴏʟᴅsᴘᴇᴇᴅ)
"""
return (pd.concat([ser.str.contains(x) for x in keywords],
axis=1, keys=keywords).astype(int))
def make_random_str_array(letters=string.ascii_letters, strlen=10, size=100):
return (np.random.choice(list(letters), size*strlen)
.view('|U{}'.format(strlen)))
keywords = make_random_str_array(size=99)
arr = np.random.choice(keywords, size=(1000, 5),replace=True)
ser = pd.Series([' '.join(row) for row in arr])
请务必检查您自己机器上的基准测试,并使用与您的用例类似的设置。结果可能因许多因素而异,例如系列的大小,ser
,keywords
的长度,硬件,操作系统,NumPy版本,Pandas和Python,以及它们的编译方式。
答案 1 :(得分:8)
您可以循环调用str.contains
。
print(df)
labels
0 created the tower
1 destroyed the tower
2 created the swimming pool
3 destroyed the swimming pool
req = ['created', 'destroyed', 'tower', 'swimming pool']
out = pd.concat([df['labels'].str.contains(x) for x in req], 1, keys=req).astype(int)
print(out)
created destroyed tower swimming pool
0 1 0 1 0
1 0 1 1 0
2 1 0 0 1
3 0 1 0 1
答案 2 :(得分:7)
使用numpy.core.defchararray.find
和numpy
braodcasting
from numpy.core.defchararray import find
v = df['labels'].values.astype(str)
l = ['created','tower','destroyed','swimming pool']
pd.DataFrame(
(find(v[:, None], l) >= 0).astype(int),
df.index, l
)
created tower destroyed swimming pool
index
1 1 1 0 0
2 0 1 1 0
3 1 0 0 1
4 0 0 1 1
find
将在我们提供的字符串数组的维度上广播str.find
函数。 find
返回字符串中第一个找到第二个字符串的字符串中的位置。如果找不到,则返回-1
。因此,我们可以通过评估find
的返回值是否大于或等于0
来评估是否找到字符串。
答案 3 :(得分:4)
在您的情况下,如果分词为the
,您可以使用以下内容来实现它。 (PS:当断言不仅The
)
pd.get_dummies(df['labels'].str.split('the').apply(pd.Series))
Out[424]:
0_created 0_destroyed 1_ swimming pool 1_ tower
0 1 0 0 1
1 0 1 0 1
2 1 0 1 0
3 0 1 1 0