我有一个像这样的pandas数据框:
id comment
1 its not proper
2 improvement needed
3 organization is proper
4 registration not done
5 timelines not proper
对于这些单词['proper','organization','done'],我想计算它们所在的id数。所以输出应该是这样的:
proper 3
organization 1
done 1
我使用for循环尝试了这个:
word_list = ['proper','organization','done']
final _list = {'proper':0,'organization':0,'done':0}
for index,row in data.iterrows():
for word in word_list:
if word in row['comment'].split(' '):
final_list[word] += 1
有没有办法在不使用任何for循环的情况下执行此操作...
答案 0 :(得分:3)
您可以使用words
并对In [23]: words = ['proper','organization','done']
In [24]: pd.DataFrame([[wrd, df['comment'].str.contains(wrd).sum()] for wrd in words])
Out[24]:
0 1
0 proper 3
1 organization 1
2 done 1
的列表理解中的bool值求和
SendMessageByLongA m_lTTHwnd, TTM_SETMAXTIPWIDTH, 0, 100000
答案 1 :(得分:1)
您可以使用.str
访问者,然后使用.str.split()
函数来分割comment
中的字符串。使用df['comment'].str.split().values
将获得包含单词的数组数组。示例 -
In [35]: df['comment'].str.split().values
Out[35]:
array([['its', 'not', 'proper'], ['improvement', 'needed'],
['organization', 'is', 'proper'], ['registration', 'not', 'done'],
['timelines', 'not', 'proper']], dtype=object)
然后,您可以使用collections.Counter
来计算所需的字符串。示例 -
word_set = {'proper','organization','done'}
result = Counter(x for lst in df['comment'].str.split().values
for x in lst if x in word_set)
这不会删除for
循环,而是使用生成器表达式,它可以比传统的for
循环快一点。
同样使用word_set
会使速度更快,因为set
中的搜索是常量时间,而在列表中搜索是O(n)。
演示 -
In [34]: df
Out[34]:
id comment
0 1 its not proper
1 2 improvement needed
2 3 organization is proper
3 4 registration not done
4 5 timelines not proper
In [35]: df['comment'].str.split().values
Out[35]:
array([['its', 'not', 'proper'], ['improvement', 'needed'],
['organization', 'is', 'proper'], ['registration', 'not', 'done'],
['timelines', 'not', 'proper']], dtype=object)
In [36]: word_set = {'proper','organization','done'}
In [37]: result = Counter(x for lst in df['comment'].str.split().values
....: for x in lst if x in word_set)
In [38]: result
Out[38]: Counter({'proper': 3, 'done': 1, 'organization': 1})
答案 2 :(得分:1)
修改:get_dummies
也可以在没有任何for
循环的情况下执行此操作:
df['comment'].str.get_dummies(' ').sum()[['proper','organization','done']]
Out[151]:
proper 3
organization 1
done 1
注意:在和之后过滤整齐地处理丢失的单词。
原始回答:到目前为止,所有答案显然都使用for
循环。避免它的一种方法是使用pd.value_counts
:
df['comment'].str.split().apply(pd.value_counts)[['proper','organization','done']]
Out[149]:
proper organization done
0 1 NaN NaN
1 NaN NaN NaN
2 1 1 NaN
3 NaN NaN 1
4 1 NaN NaN
您所要做的就是对结果数据帧求和:
_.sum()
Out[150]:
proper 3
organization 1
done 1
如果列表中的某个单词不在文本的任何位置,则只需要调整代码。
答案 3 :(得分:1)
In [105]:
words = ['proper','organization','done']
for word in words:
df[word] = df.comment.str.contains('\\b' + word + '\\b' , case = True , regex = True)
Out[105]:
comment proper organization done
its not proper True False False
improvement needed False False False
organization is proper True True False
registration not done False False True
timelines not proper True False False
In [103]:
df.iloc[: , 1:].sum()
Out[103]:
proper 3
organization 1
done 1
dtype: int64
答案 4 :(得分:0)
您可以在pandas中使用.str.contains()
方法:
import pandas as pd
cols = ['id', 'comment']
data = [[1, 'its not proper'],
[2, 'improvement needed'],
[3, 'organization is proper'],
[4, 'registration not done'],
[5, 'timelines not proper']]
df = pd.DataFrame(data, columns=cols)
word_list = ['proper','organization','done']
row_counts = {word: df[df.comment.str.contains(word)].shape[0]
for word in word_list}
print row_counts
# output is:
# {'proper': 3, 'organization': 1, 'done': 1}