计算Python中标点符号之间的单词数

时间:2014-01-18 19:49:41

标签: python parsing text package text-analysis

我想使用Python来计算文本输入块中某些标点符号之间出现的单词数。例如,对此处所写的所有内容的此类分析可能表示为:

[23,2,14]

...因为除了句末之外没有标点符号的第一个句子有23个单词,接下来的“例如”短语有两个,其余的以冒号结尾有14个。

这可能不会太难,但(与“不要重新发明轮子”的哲学相似,特别是Pythonic)是否有任何特别适合这项任务的东西?

2 个答案:

答案 0 :(得分:3)

punctuation_i_care_about="?.!"
split_by_punc =  re.split("[%s]"%punctuation_i_care_about, some_big_block_of_text)
words_by_puct = [len(x.split()) for x in split_by_punc]

答案 1 :(得分:3)

Joran打败了我,但我会加入我的方法:

from string import punctuation
import re

s = 'I want to use Python to count the numbers of words that occur between certain punctuation characters in a block of text input. For example, such an analysis of everything written up to this point might be represented as'

gen = (x.split() for x in re.split('[' + punctuation + ']',s))

list(map(len,gen))
Out[32]: [23, 2, 14]

(我喜欢map