Question

我有一个字符串列表，如下所示：

df_train = [＆＃39;你好约翰 - 史密斯这是9点＆＃39;，＆＃39;这是一个完全随机的序列＆＃39;]

我想sklearn TfidfVectorizer将用连字符连接的单词作为整个单词处理。当我应用以下代码时，用连字符（或其他标点符号）分隔的单词将被视为单独的单词：

vectorizer_train = TfidfVectorizer(analyzer = 'word',
                                       min_df=0.0,
                                       max_df = 1.0,
                                       strip_accents = None,
                                       encoding = 'utf-8', 
                                       preprocessor=None,
                                       token_pattern=r"(?u)\b\w\w+\b")

vectorizer_train.fit_transform(df_train)
vectorizer_train.get_feature_names()

我更改了参数token_pattern但没有成功。知道如何解决这个问题吗？此外，是否可以将任何标点符号分隔的单词视为单个实体？（例如＆＃39; Hi.there How_are you：doing＆＃39;）

Answer 1

似乎您只需要在空白区域上进行拆分，尝试将模式切换为(?u)\S\S+，这会将连续的非空白字符捕获为一个单词：

df_train = ["Hello John-Smith it is nine o'clock",
            "This is a completely random-sequence", 
            "Hi.there How_are you:doing"]

vectorizer_train = TfidfVectorizer(analyzer = 'word',
                                       min_df=0.0,
                                       max_df = 1.0,
                                       strip_accents = None,
                                       encoding = 'utf-8', 
                                       preprocessor=None,
                                       token_pattern=r"(?u)\S\S+")

vectorizer_train.fit_transform(df_train)
vectorizer_train.get_feature_names()

给出：

['completely',
 'hello',
 'hi.there',
 'how_are',
 'is',
 'it',
 'john-smith',
 'nine',
 "o'clock",
 'random-sequence',
 'this',
 'you:doing']

要仅尊重带连字符的化合物，您可以使用(?u)\b\w[\w-]*\w\b：

['clock',
 'completely',
 'doing',
 'hello',
 'hi',
 'how_are',
 'is',
 'it',
 'john-smith',
 'nine',
 'random-sequence',
 'there',
 'this',
 'you']

TfidfVectorizer尊重带连字符的化合物（用连字符连接的单词）

1 个答案: