Question

我正在努力计算一堆文字。我有一个带有文本列的pandas数据框，我正确地标记，删除停用词和词干。最后，对于每个文档，我都有一个字符串列表。

我的最终目标是计算本专栏的单词包，我已经看到scikit-learn有一个功能可以做到这一点，但它适用于字符串，而不是字符串列表。

我正在使用NLTK进行预处理，并希望保持这种方式......

有没有办法根据令牌列表计算一揽子单词？例如，类似的东西：

["hello", "world"]
["hello", "stackoverflow", "hello"]

应转换为

[1, 1, 0]
[2, 0, 1]

词汇：

["hello", "world", "stackoverflow"]

Answer 1

您可以通过使用DataFrame进行过滤来创建Counter，然后转换为list s：

from collections import Counter

df = pd.DataFrame({'text':[["hello", "world"],
                           ["hello", "stackoverflow", "hello"]]})

L = ["hello", "world", "stackoverflow"]

f = lambda x: Counter([y for y in x if y in L])
df['new'] = (pd.DataFrame(df['text'].apply(f).values.tolist())
               .fillna(0)
               .astype(int)
               .reindex(columns=L)
               .values
               .tolist())
print (df)

                            text        new
0                 [hello, world]  [1, 1, 0]
1  [hello, stackoverflow, hello]  [2, 0, 1]

Answer 2

sklearn.feature_extraction.text.CountVectorizer可以提供很多帮助。这是官方文件的例子：

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)
X.toarray() 
/*array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
   [0, 1, 0, 1, 0, 2, 1, 0, 1],
   [1, 0, 0, 0, 1, 0, 1, 1, 0],
   [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)*/

您可以使用方法vectorizer.get_feature_names（）获取功能名称。

Answer 3

使用sklearn.feature_extraction.text.CountVectorizer

fun ref

输出：

actor Main
  var i: U32 = 0

  fun ref foo() =>
    i = i + 1

  new create(env: Env) =>
    env.out.print(i.string())
    foo()
    env.out.print(i.string())

Python - 从令牌列表到单词包

3 个答案:

使用sklearn.feature_extraction.text.CountVectorizer