TL; DR

Question

更新

尽管进行了严格的清理，但是一些带句点的单词仍然被标记化，句点完好无损，包括在句点和引号之间填充空格的字符串。我在Jupyter笔记本中创建了一个公共链接，其中包含问题示例：https://drive.google.com/file/d/0B90qb2J7ZLYrZmItME5RRlhsVWM/view?usp=sharing

或者是一个较短的例子：

word_tokenize('This is a test. "')
['This', 'is', 'a', 'test.', '``']

但是当使用其他类型的双引号时消失：

word_tokenize('This is a test. ”')
['This', 'is', 'a', 'test', '.', '”']

原始

我正在编写大量文本并创建一个计数器来查看每个单词的计数，然后我将该计数器转移到数据框以便于处理。每段文字都是100-5000字之间的大字符串。带有单词计数的数据框看起来像这样，只需要计数为11的单词，例如：

allwordsdf[(allwordsdf['count'] == 11)]


        words          count
551     throughlin     11
1921    rampd          11
1956    pinhol         11
2476    reckhow        11

我注意到的是，有很多单词没有完全被删除，并且它们的末尾附有句号。例如：

4233    activist.   11
9243    storyline.  11

我不确定这是什么原因。我知道它通常是单独的句号，因为句号行代表：

23  .   5702880

此外，似乎它并没有为“活动家”的每一个例子做这件事。：

len(articles[articles['content'].str.contains('activist.')])
9600

不确定我是否忽略了某些东西---昨天我遇到了problem with the NLTK stemmer that was a bug，我不知道是不是我正在做的事情（总是更有可能）。

感谢任何指导。

编辑：

这是我正在使用的功能：

progress = 0
start = time.time()

def stem(x):
    end = time.time()
    tokens = word_tokenize(x)
    global start
    global progress
    progress += 1
    sys.stdout.write('\r {} percent, {} position, {} per second '.format(str(float(progress / len(articles))), 
                                                         str(progress), (1 / (end - start))))

    stems = [stemmer.stem(e) for e in tokens]
    start = time.time()
    return stems


articles['stems'] = articles.content.apply(lambda x: stem(x))

编辑2：

Here is a JSON对某些数据：所有字符串，标记和词干。

这是我在查找所有单词后得到的内容，在标记化和词干化之后仍然有句点：

allwordsdf[allwordsdf['words'].str.contains('\.')] #dataframe made from the counter dict

    words   count
23  .       5702875
63  years.  1231
497 was.    281
798 lost.   157
817 jie.    1
819 teacher.24
858 domains.1
875 fallout.3
884 net.    23
889 option. 89
895 step.   67
927 pool.   30
936 that.   4245
954 compute.2
1001 dr.    11007
1010 decisions. 159

该片的长度约为49,000。

编辑3：

Alvas的回答帮助减少了大约一半的单词数量，减少了24,000个单词，总计数为518980，这是很多。正如我发现的那样，问题在于它每次都有一段时间和一个引号。例如，取字符串'sickened`，它在标记化的单词中出现一次。

如果我搜索语料库：

articles[articles['content'].str.contains(r'sickened\.[^\s]')]

它出现的整个corupus中唯一的位置是：

...said he was “sickened.” Trump's running mate...

这不是一个孤立的事件，而是我在搜索这些术语时一遍又一遍地看到的。他们每次都有一个引号。 tokenizer不仅不能处理带有character-period-quotation-character的单词，还能处理character-period-quotation-whitespace。

Answer 1

你需要在词干之前对字符串进行标记：

>>> from nltk.stem import PorterStemmer
>>> from nltk import word_tokenize
>>> text = 'This is a foo bar sentence, that contains punctuations.'
>>> porter = PorterStemmer()
>>> [porter.stem(word) for word in text.split()]
[u'thi', 'is', 'a', 'foo', 'bar', 'sentence,', 'that', u'contain', 'punctuations.']
>>> [porter.stem(word) for word in word_tokenize(text)]
[u'thi', 'is', 'a', 'foo', 'bar', u'sentenc', ',', 'that', u'contain', u'punctuat', '.']

在数据框中：

porter = PorterStemmer()
articles['tokens'] = articles['content'].apply(word_tokenize)
articles['stem'] = articles['tokens'].apply(lambda x: [porter.stem(word) for word in x])

>>> import pandas as pd
>>> from nltk.stem import PorterStemmer
>>> from nltk import word_tokenize
>>> sents = ['This is a foo bar, sentence.', 'Yet another, foo bar!']
>>> df = pd.DataFrame(sents, columns=['content'])
>>> df
                        content
0  This is a foo bar, sentence.
1         Yet another, foo bar!

# Apply tokenizer.
>>> df['tokens'] = df['content'].apply(word_tokenize)
>>> df
                        content                                   tokens
0  This is a foo bar, sentence.  [This, is, a, foo, bar, ,, sentence, .]
1         Yet another, foo bar!           [Yet, another, ,, foo, bar, !]

# Without DataFrame.apply
>>> df['tokens'][0]
['This', 'is', 'a', 'foo', 'bar', ',', 'sentence', '.']
>>> [porter.stem(word) for word in df['tokens'][0]]
[u'thi', 'is', 'a', 'foo', 'bar', ',', u'sentenc', '.']

# With DataFrame.apply
>>> df['tokens'].apply(lambda row: [porter.stem(word) for word in row])
0    [thi, is, a, foo, bar, ,, sentenc, .]
1             [yet, anoth, ,, foo, bar, !]

# Or if you like nested lambdas.
>>> df['tokens'].apply(lambda x: map(lambda y: porter.stem(y), x))
0    [thi, is, a, foo, bar, ,, sentenc, .]
1             [yet, anoth, ,, foo, bar, !]

Answer 2

code from the answer above works表示干净的文字：

porter = PorterStemmer()
sents = ['This is a foo bar, sentence.', 'Yet another, foo bar!']
articles = pd.DataFrame(sents, columns=['content'])
articles['tokens'] = articles['content'].apply(word_tokenize)
articles['stem'] = articles['tokens'].apply(lambda x: [porter.stem(word) for word in x])

查看JSON文件，您的数据非常脏。最有可能的是，当您从网站上删除文本时，您没有在<p>...</p>标签或您正在提取的部分之间放置空格，这会导致大量文本，例如：

“所以[现在] AlphaGo实际上从自己的搜索中学习以改进它神经网络，包括政策网络和价值网络，以及这使得它以更加通用的方式学习。其中一件事我们最兴奋的不仅仅是它可以发挥更好，但我们希望这实际上会带来更多的技术通常适用于其他具有挑战性的领域。“AlphaGo是由两个网络组成：一个选择下一步行动的政策网络玩，以及分析获胜概率的价值网络。政策网络最初基于数百万的历史举措来自Go专业人士的实际游戏。但AlphaGo大师去了通过搜索可能发生的可能的移动进一步如果发挥特定的动作，增加对它的理解潜在的后果。“原始系统对抗自己数百万有时，但它没有使用搜索的这个组成部分，“ 哈萨比斯告诉The Verge。 “[AlphaGo大师]正在利用自己的力量改善自己的预测。所以在之前的版本中它主要是关于生成数据，在这个版本中它实际上正在使用它自身搜索功能的强大功能和自身的改进能力政策网的一部分。“

请注意，在很多情况下，您可以直接在fullstop后面打开引号，例如domains.”AlphaGo。

如果您尝试使用默认的NLTK word_tokenize功能，您将获得domains.，”，AlphaGo;即

>>> from nltk import word_tokenize

>>> text = u"""“So [now] AlphaGo actually learns from its own searches to improve its neural networks, both the policy network and the value network, and this makes it learn in a much more general way. One of the things we’re most excited about is not just that it can play Go better but we hope that this’ll actually lead to technologies that are more generally applicable to other challenging domains.”AlphaGo is comprised of two networks: a policy network that selects the next move to play, and a value network that analyzes the probability of winning. The policy network was initially based on millions of historical moves from actual games played by Go professionals. But AlphaGo Master goes much further by searching through the possible moves that could occur if a particular move is played, increasing its understanding of the potential fallout.“The original system played against itself millions of times, but it didn’t have this component of using the search,” Hassabis tells The Verge. “[AlphaGo Master is] using its own strength to improve its own predictions. So whereas in the previous version it was mostly about generating data, in this version it’s actually using the power of its own search function and its own abilities to improve one part of itself, the policy net.”"""

>>> word_tokenize(text)
[u'\u201c', u'So', u'[', u'now', u']', u'AlphaGo', u'actually', u'learns', u'from', u'its', u'own', u'searches', u'to', u'improve', u'its', u'neural', u'networks', u',', u'both', u'the', u'policy', u'network', u'and', u'the', u'value', u'network', u',', u'and', u'this', u'makes', u'it', u'learn', u'in', u'a', u'much', u'more', u'general', u'way', u'.', u'One', u'of', u'the', u'things', u'we', u'\u2019', u're', u'most', u'excited', u'about', u'is', u'not', u'just', u'that', u'it', u'can', u'play', u'Go', u'better', u'but', u'we', u'hope', u'that', u'this', u'\u2019', u'll', u'actually', u'lead', u'to', u'technologies', u'that', u'are', u'more', u'generally', u'applicable', u'to', u'other', u'challenging', u'domains.', u'\u201d', u'AlphaGo', u'is', u'comprised', u'of', u'two', u'networks', u':', u'a', u'policy', u'network', u'that', u'selects', u'the', u'next', u'move', u'to', u'play', u',', u'and', u'a', u'value', u'network', u'that', u'analyzes', u'the', u'probability', u'of', u'winning', u'.', u'The', u'policy', u'network', u'was', u'initially', u'based', u'on', u'millions', u'of', u'historical', u'moves', u'from', u'actual', u'games', u'played', u'by', u'Go', u'professionals', u'.', u'But', u'AlphaGo', u'Master', u'goes', u'much', u'further', u'by', u'searching', u'through', u'the', u'possible', u'moves', u'that', u'could', u'occur', u'if', u'a', u'particular', u'move', u'is', u'played', u',', u'increasing', u'its', u'understanding', u'of', u'the', u'potential', u'fallout.', u'\u201c', u'The', u'original', u'system', u'played', u'against', u'itself', u'millions', u'of', u'times', u',', u'but', u'it', u'didn', u'\u2019', u't', u'have', u'this', u'component', u'of', u'using', u'the', u'search', u',', u'\u201d', u'Hassabis', u'tells', u'The', u'Verge', u'.', u'\u201c', u'[', u'AlphaGo', u'Master', u'is', u']', u'using', u'its', u'own', u'strength', u'to', u'improve', u'its', u'own', u'predictions', u'.', u'So', u'whereas', u'in', u'the', u'previous', u'version', u'it', u'was', u'mostly', u'about', u'generating', u'data', u',', u'in', u'this', u'version', u'it', u'\u2019', u's', u'actually', u'using', u'the', u'power', u'of', u'its', u'own', u'search', u'function', u'and', u'its', u'own', u'abilities', u'to', u'improve', u'one', u'part', u'of', u'itself', u',', u'the', u'policy', u'net', u'.', u'\u201d']

>>> 'domains.' in word_tokenize(text)
True

所以有几种方法可以解决这个问题，这里有几个：

尝试清理数据，然后再将其提供给word_tokenize功能，例如首先在标点之间填充空格
尝试使用其他标记器，例如MosesTokenizer

首先在标点之间填充空格

>>> import re
>>> clean_text = re.sub('([.,!?()])', r' \1 ', text)
>>> word_tokenize(clean_text)
[u'\u201c', u'So', u'[', u'now', u']', u'AlphaGo', u'actually', u'learns', u'from', u'its', u'own', u'searches', u'to', u'improve', u'its', u'neural', u'networks', u',', u'both', u'the', u'policy', u'network', u'and', u'the', u'value', u'network', u',', u'and', u'this', u'makes', u'it', u'learn', u'in', u'a', u'much', u'more', u'general', u'way', u'.', u'One', u'of', u'the', u'things', u'we', u'\u2019', u're', u'most', u'excited', u'about', u'is', u'not', u'just', u'that', u'it', u'can', u'play', u'Go', u'better', u'but', u'we', u'hope', u'that', u'this', u'\u2019', u'll', u'actually', u'lead', u'to', u'technologies', u'that', u'are', u'more', u'generally', u'applicable', u'to', u'other', u'challenging', u'domains', u'.', u'\u201d', u'AlphaGo', u'is', u'comprised', u'of', u'two', u'networks', u':', u'a', u'policy', u'network', u'that', u'selects', u'the', u'next', u'move', u'to', u'play', u',', u'and', u'a', u'value', u'network', u'that', u'analyzes', u'the', u'probability', u'of', u'winning', u'.', u'The', u'policy', u'network', u'was', u'initially', u'based', u'on', u'millions', u'of', u'historical', u'moves', u'from', u'actual', u'games', u'played', u'by', u'Go', u'professionals', u'.', u'But', u'AlphaGo', u'Master', u'goes', u'much', u'further', u'by', u'searching', u'through', u'the', u'possible', u'moves', u'that', u'could', u'occur', u'if', u'a', u'particular', u'move', u'is', u'played', u',', u'increasing', u'its', u'understanding', u'of', u'the', u'potential', u'fallout', u'.', u'\u201c', u'The', u'original', u'system', u'played', u'against', u'itself', u'millions', u'of', u'times', u',', u'but', u'it', u'didn', u'\u2019', u't', u'have', u'this', u'component', u'of', u'using', u'the', u'search', u',', u'\u201d', u'Hassabis', u'tells', u'The', u'Verge', u'.', u'\u201c', u'[', u'AlphaGo', u'Master', u'is', u']', u'using', u'its', u'own', u'strength', u'to', u'improve', u'its', u'own', u'predictions', u'.', u'So', u'whereas', u'in', u'the', u'previous', u'version', u'it', u'was', u'mostly', u'about', u'generating', u'data', u',', u'in', u'this', u'version', u'it', u'\u2019', u's', u'actually', u'using', u'the', u'power', u'of', u'its', u'own', u'search', u'function', u'and', u'its', u'own', u'abilities', u'to', u'improve', u'one', u'part', u'of', u'itself', u',', u'the', u'policy', u'net', u'.', u'\u201d']
>>> 'domains.' in word_tokenize(clean_text)
False

使用MosesTokenizer：

>>> from nltk.tokenize.moses import MosesTokenizer
>>> mo = MosesTokenizer()
>>> mo.tokenize(text)
[u'\u201c', u'So', u'&#91;', u'now', u'&#93;', u'AlphaGo', u'actually', u'learns', u'from', u'its', u'own', u'searches', u'to', u'improve', u'its', u'neural', u'networks', u',', u'both', u'the', u'policy', u'network', u'and', u'the', u'value', u'network', u',', u'and', u'this', u'makes', u'it', u'learn', u'in', u'a', u'much', u'more', u'general', u'way', u'.', u'One', u'of', u'the', u'things', u'we', u'\u2019', u're', u'most', u'excited', u'about', u'is', u'not', u'just', u'that', u'it', u'can', u'play', u'Go', u'better', u'but', u'we', u'hope', u'that', u'this', u'\u2019', u'll', u'actually', u'lead', u'to', u'technologies', u'that', u'are', u'more', u'generally', u'applicable', u'to', u'other', u'challenging', u'domains', u'.', u'\u201d', u'AlphaGo', u'is', u'comprised', u'of', u'two', u'networks', u':', u'a', u'policy', u'network', u'that', u'selects', u'the', u'next', u'move', u'to', u'play', u',', u'and', u'a', u'value', u'network', u'that', u'analyzes', u'the', u'probability', u'of', u'winning', u'.', u'The', u'policy', u'network', u'was', u'initially', u'based', u'on', u'millions', u'of', u'historical', u'moves', u'from', u'actual', u'games', u'played', u'by', u'Go', u'professionals', u'.', u'But', u'AlphaGo', u'Master', u'goes', u'much', u'further', u'by', u'searching', u'through', u'the', u'possible', u'moves', u'that', u'could', u'occur', u'if', u'a', u'particular', u'move', u'is', u'played', u',', u'increasing', u'its', u'understanding', u'of', u'the', u'potential', u'fallout', u'.', u'\u201c', u'The', u'original', u'system', u'played', u'against', u'itself', u'millions', u'of', u'times', u',', u'but', u'it', u'didn', u'\u2019', u't', u'have', u'this', u'component', u'of', u'using', u'the', u'search', u',', u'\u201d', u'Hassabis', u'tells', u'The', u'Verge', u'.', u'\u201c', u'&#91;', u'AlphaGo', u'Master', u'is', u'&#93;', u'using', u'its', u'own', u'strength', u'to', u'improve', u'its', u'own', u'predictions', u'.', u'So', u'whereas', u'in', u'the', u'previous', u'version', u'it', u'was', u'mostly', u'about', u'generating', u'data', u',', u'in', u'this', u'version', u'it', u'\u2019', u's', u'actually', u'using', u'the', u'power', u'of', u'its', u'own', u'search', u'function', u'and', u'its', u'own', u'abilities', u'to', u'improve', u'one', u'part', u'of', u'itself', u',', u'the', u'policy', u'net', u'.', u'\u201d']
>>> 'domains.' in mo.tokenize(text)
False

TL; DR

使用：

from nltk.tokenize.moses import MosesTokenizer
mo = MosesTokenizer()
articles['tokens'] = articles['content'].apply(mo.tokenize)
articles['stem'] = articles['tokens'].apply(lambda x: [porter.stem(word) for word in x])

或者：

articles['clean'] = articles['content'].apply(lambda x: re.sub('([.,!?()])', r' \1 ', x)
articles['tokens'] = articles['clean'].apply(word_tokenize)
articles['stem'] = articles['tokens'].apply(lambda x: [porter.stem(word) for word in x])

NLTK词干分析偶尔会在词干

更新

原始

编辑：

编辑2：

编辑3：

2 个答案:

TL; DR