Pandas Dataframe:根据文本列中单词的出现计算ID数

时间:2015-10-29 06:21:06

标签: python string pandas dataframe

我有一个像这样的pandas数据框:

id   comment

1    its not proper
2    improvement needed
3    organization is proper
4    registration not done
5    timelines not proper

对于这些单词['proper','organization','done'],我想计算它们所在的id数。所以输出应该是这样的:

proper         3
organization   1
done           1

我使用for循环尝试了这个:

word_list = ['proper','organization','done']
final _list = {'proper':0,'organization':0,'done':0}
for index,row in data.iterrows():
    for word in word_list:
        if word in row['comment'].split(' '):
            final_list[word] += 1

有没有办法在不使用任何for循环的情况下执行此操作...

5 个答案:

答案 0 :(得分:3)

您可以使用words并对In [23]: words = ['proper','organization','done'] In [24]: pd.DataFrame([[wrd, df['comment'].str.contains(wrd).sum()] for wrd in words]) Out[24]: 0 1 0 proper 3 1 organization 1 2 done 1 的列表理解中的bool值求和

SendMessageByLongA m_lTTHwnd, TTM_SETMAXTIPWIDTH, 0, 100000

答案 1 :(得分:1)

您可以使用.str访问者,然后使用.str.split()函数来分割comment中的字符串。使用df['comment'].str.split().values将获得包含单词的数组数组。示例 -

In [35]: df['comment'].str.split().values
Out[35]:
array([['its', 'not', 'proper'], ['improvement', 'needed'],
       ['organization', 'is', 'proper'], ['registration', 'not', 'done'],
       ['timelines', 'not', 'proper']], dtype=object)

然后,您可以使用collections.Counter来计算所需的字符串。示例 -

word_set = {'proper','organization','done'}
result = Counter(x for lst in df['comment'].str.split().values
                   for x in lst if x in word_set)

这不会删除for循环,而是使用生成器表达式,它可以比传统的for循环快一点。

同样使用word_set会使速度更快,因为set中的搜索是常量时间,而在列表中搜索是O(n)。

演示 -

In [34]: df
Out[34]:
   id                 comment
0   1          its not proper
1   2      improvement needed
2   3  organization is proper
3   4   registration not done
4   5    timelines not proper

In [35]: df['comment'].str.split().values
Out[35]:
array([['its', 'not', 'proper'], ['improvement', 'needed'],
       ['organization', 'is', 'proper'], ['registration', 'not', 'done'],
       ['timelines', 'not', 'proper']], dtype=object)

In [36]: word_set = {'proper','organization','done'}

In [37]: result = Counter(x for lst in df['comment'].str.split().values
   ....:                    for x in lst if x in word_set)

In [38]: result
Out[38]: Counter({'proper': 3, 'done': 1, 'organization': 1})

答案 2 :(得分:1)

修改get_dummies也可以在没有任何for循环的情况下执行此操作:

df['comment'].str.get_dummies(' ').sum()[['proper','organization','done']]

Out[151]: 
proper          3
organization    1
done            1

注意:在和之后过滤整齐地处理丢失的单词。

原始回答:到目前为止,所有答案显然都使用for循环。避免它的一种方法是使用pd.value_counts

df['comment'].str.split().apply(pd.value_counts)[['proper','organization','done']]

Out[149]: 
   proper  organization  done
0       1           NaN   NaN
1     NaN           NaN   NaN
2       1             1   NaN
3     NaN           NaN     1
4       1           NaN   NaN

您所要做的就是对结果数据帧求和:

_.sum()

Out[150]: 
proper          3
organization    1
done            1

如果列表中的某个单词不在文本的任何位置,则只需要调整代码。

答案 3 :(得分:1)

In [105]:
words = ['proper','organization','done']
for word in words:
    df[word] = df.comment.str.contains('\\b' + word + '\\b' , case = True , regex = True)

Out[105]:
comment                         proper  organization    done
its not proper                   True   False          False
improvement needed               False  False          False
organization is proper           True   True           False
registration not done            False  False          True
timelines not proper             True   False          False

In [103]:    
df.iloc[: , 1:].sum()
Out[103]:
proper          3
organization    1
done            1
dtype: int64

答案 4 :(得分:0)

您可以在pandas中使用.str.contains()方法:

import pandas as pd

cols = ['id', 'comment']
data = [[1, 'its not proper'],
        [2, 'improvement needed'],
        [3, 'organization is proper'],
        [4, 'registration not done'],
        [5, 'timelines not proper']]
df = pd.DataFrame(data, columns=cols)
word_list = ['proper','organization','done']
row_counts = {word: df[df.comment.str.contains(word)].shape[0]
              for word in word_list}
print row_counts
# output is:
# {'proper': 3, 'organization': 1, 'done': 1}