TfidfVectorizer尊重带连字符的化合物(用连字符连接的单词)

时间:2017-08-21 21:13:00

标签: python regex scikit-learn token tf-idf

我有一个字符串列表,如下所示:

df_train = ['你好约翰 - 史密斯这是9点','这是一个完全随机的序列']

我想sklearn TfidfVectorizer将用连字符连接的单词作为整个单词处理。当我应用以下代码时,用连字符(或其他标点符号)分隔的单词将被视为单独的单词:

vectorizer_train = TfidfVectorizer(analyzer = 'word',
                                       min_df=0.0,
                                       max_df = 1.0,
                                       strip_accents = None,
                                       encoding = 'utf-8', 
                                       preprocessor=None,
                                       token_pattern=r"(?u)\b\w\w+\b")

vectorizer_train.fit_transform(df_train)
vectorizer_train.get_feature_names()

我更改了参数token_pattern但没有成功。知道如何解决这个问题吗?此外,是否可以将任何标点符号分隔的单词视为单个实体? (例如' Hi.there How_are you:doing')

1 个答案:

答案 0 :(得分:2)

似乎您只需要在空白区域上进行拆分,尝试将模式切换为(?u)\S\S+,这会将连续的非空白字符捕获为一个单词:

df_train = ["Hello John-Smith it is nine o'clock",
            "This is a completely random-sequence", 
            "Hi.there How_are you:doing"]

vectorizer_train = TfidfVectorizer(analyzer = 'word',
                                       min_df=0.0,
                                       max_df = 1.0,
                                       strip_accents = None,
                                       encoding = 'utf-8', 
                                       preprocessor=None,
                                       token_pattern=r"(?u)\S\S+")
​
vectorizer_train.fit_transform(df_train)
vectorizer_train.get_feature_names()

给出:

['completely',
 'hello',
 'hi.there',
 'how_are',
 'is',
 'it',
 'john-smith',
 'nine',
 "o'clock",
 'random-sequence',
 'this',
 'you:doing']

要仅尊重带连字符的化合物,您可以使用(?u)\b\w[\w-]*\w\b

['clock',
 'completely',
 'doing',
 'hello',
 'hi',
 'how_are',
 'is',
 'it',
 'john-smith',
 'nine',
 'random-sequence',
 'there',
 'this',
 'you']