我有一个字符串列表,如下所示:
df_train = ['你好约翰 - 史密斯这是9点','这是一个完全随机的序列']
我想sklearn TfidfVectorizer将用连字符连接的单词作为整个单词处理。当我应用以下代码时,用连字符(或其他标点符号)分隔的单词将被视为单独的单词:
vectorizer_train = TfidfVectorizer(analyzer = 'word',
min_df=0.0,
max_df = 1.0,
strip_accents = None,
encoding = 'utf-8',
preprocessor=None,
token_pattern=r"(?u)\b\w\w+\b")
vectorizer_train.fit_transform(df_train)
vectorizer_train.get_feature_names()
我更改了参数token_pattern但没有成功。知道如何解决这个问题吗?此外,是否可以将任何标点符号分隔的单词视为单个实体? (例如' Hi.there How_are you:doing')
答案 0 :(得分:2)
似乎您只需要在空白区域上进行拆分,尝试将模式切换为(?u)\S\S+
,这会将连续的非空白字符捕获为一个单词:
df_train = ["Hello John-Smith it is nine o'clock",
"This is a completely random-sequence",
"Hi.there How_are you:doing"]
vectorizer_train = TfidfVectorizer(analyzer = 'word',
min_df=0.0,
max_df = 1.0,
strip_accents = None,
encoding = 'utf-8',
preprocessor=None,
token_pattern=r"(?u)\S\S+")
vectorizer_train.fit_transform(df_train)
vectorizer_train.get_feature_names()
给出:
['completely',
'hello',
'hi.there',
'how_are',
'is',
'it',
'john-smith',
'nine',
"o'clock",
'random-sequence',
'this',
'you:doing']
要仅尊重带连字符的化合物,您可以使用(?u)\b\w[\w-]*\w\b
:
['clock',
'completely',
'doing',
'hello',
'hi',
'how_are',
'is',
'it',
'john-smith',
'nine',
'random-sequence',
'there',
'this',
'you']