我有一个类似的过程:
问题在于(可能是因为标记化首先出现?)多字停用词(短语)不会被删除。
完整示例:
import re
import nltk
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as ESW, CountVectorizer
# Make sure we have the corpora used by nltk's lemmatizer
try:
nltk.data.find('corpora/wordnet')
except:
nltk.download('wordnet')
# "Naive" token similar to that used by sklearn
TOKEN = re.compile(r'\b\w{2,}\b')
# Tokenize, then lemmatize these tokens
# Modified from:
# http://scikit-learn.org/stable/modules/feature_extraction.html#customizing-the-vectorizer-classes
class LemmaTokenizer(object):
def __init__(self):
self.wnl = WordNetLemmatizer()
def __call__(self, doc):
return (self.wnl.lemmatize(t) for t in TOKEN.findall(doc))
# Add 1 more phrase to sklearn's stop word list
sw = ESW.union(frozenset(['sinclair broadcast group']))
vect = CountVectorizer(stop_words=sw, ngram_range=(1, 4),
tokenizer=LemmaTokenizer())
# These are nonsense babbling
docs = ["""And you ask Why You Are Sinclair Broadcast Group is Asking It""",
"""Why are you asking what Sinclair Broadcast Group and you"""]
tf = vect.fit_transform(docs)
重申:单词停用词已被正确删除,但短语仍为:
vect.get_feature_names()
# ['ask',
# 'ask sinclair',
# 'ask sinclair broadcast',
# 'ask sinclair broadcast group',
# 'asking',
# 'asking sinclair',
# 'asking sinclair broadcast',
# 'asking sinclair broadcast group',
# 'broadcast',
# 'broadcast group',
# 'broadcast group asking',
# 'group',
# 'group asking',
# 'sinclair',
# 'sinclair broadcast',
# 'sinclair broadcast group',
# 'sinclair broadcast group asking']
我该如何纠正?
答案 0 :(得分:1)
来自the documentation of CountVectorizer
:
stop_words:string {'english'},list或None(默认)
如果是'english',则使用英语的内置停用词列表。
如果列表,该列表被假定包含停用词,则所有这些词都将从生成的标记中删除。仅适用于analyzer =='word'。
如果为None,则不使用停用词。 max_df可以设置为[0.7,1.0]范围内的值,以根据术语的语料库文档频率自动检测和过滤停用词。
进一步查看参数token_pattern
:
token_pattern:string
正则表达式表示构成“令牌”的内容,仅在analyzer =='word'时使用。默认正则表达式选择2个或更多字母数字字符的标记(标点符号完全被忽略,并始终被视为标记分隔符)。
因此,如果analyzer(token)
的结果等于'sinclair broadcast group'
,则只会删除停用词。但默认analyzer
为'word'
,这意味着停用词检测仅适用于单个单词,因为令牌由默认token_pattern
定义,如上所述。
标记不是 n-gram(相反,n-gram由标记组成,并且在构造n-gram之前,在标记级别出现停止字删除)。
作为快速检查,您可以将自定义停用词更改为实验的'sinclair'
,以便在将其视为孤立单词时可以正确删除该单词。
换句话说,您需要将自己的可调用函数传递给analyzer
,以便将分析器逻辑应用于n-gram,您必须手动检查。但默认行为假设禁止词检测不能适用于n-gram,仅适用于单个单词。
以下是针对您的案例的自定义分析器功能的示例。这是based on this answer ...注意我没有测试它,所以可能存在错误。
def trigram_match(i, trigram, words):
if i < len(words) - 2 and words[i:i + 3] == trigram:
return True
if (i > 0 and i < len(words) - 1) and words[i - 1:i + 2] == trigram:
return True
if i > 1 and words[i - 2:i + 1] == trigram:
return True
return False
def custom_analyzer(text):
bad_trigram = ['sinclair', 'broadcasting', 'group']
words = [str.lower(w) for w in re.findall(r'\w{2,}', text)]
for i, w in enumerate(words):
if w in sw or trigram_match(i, bad_trigram, words):
continue
yield w
答案 1 :(得分:0)
这是一款适合我的自定义分析仪。它有点笨拙,但有效地完成了所有文本处理并且速度相当快:
Highcharts.chart('container', {
chart: {
type: 'column'
},
title: {
text: 'Browser market shares. January, 2015 to May, 2015'
},
subtitle: {
text: 'Click the columns to view versions. Source: <a href="http://netmarketshare.com">netmarketshare.com</a>.'
},
xAxis: {
type: 'category'
},
yAxis: {
title: {
text: 'Total percent market share'
}
},
legend: {
enabled: false
},
plotOptions: {
series: {
borderWidth: 0,
dataLabels: {
enabled: true,
format: '{point.y:.1f}%'
}
}
},
tooltip: {
headerFormat: '<span style="font-size:11px">{series.name}</span><br>',
pointFormat: '<span style="color:{point.color}">{point.name}</span>: <b>{point.y:.2f}%</b> of total<br/>'
},
series: [{
name: 'Brands',
colorByPoint: true,
data: [{
name: 'Microsoft Internet Explorer',
y: 56.33,
drilldown: 'Microsoft Internet Explorer'
}, {
name: 'Firefox',
y: 10.38,
drilldown: 'Firefox'
}]
}],
drilldown: {
series: [{
name: 'Microsoft Internet Explorer',
id: 'Microsoft Internet Explorer',
data: [
[
'v11.0',
14 // Over max set in zone get default color
],
[
'v8.0',
10
],
[
'v6.0',
8
],
[
'v7.0',
4
]
],
zones: [{
color: '#ffcc00'
}, {
color: '#AA7F39'
}, {
color: '#ff8000'
}, {
color: '#FFFF00'
}]
}, {
name: 'Firefox',
id: 'Firefox',
data: [
[
'v35',
12,
],
[
'v34',
10
],
[
'v38',
7
],
[
'v33',
4
],
[
'v32',
3
]
],
zones: [{
color: '#FFFF00'
}, {
color: '#FFFF00'
}, {
color: '#ff8000'
}, {
color: '#AA7F39'
}]
}]
}
});