我试图创建一个名为" words_in_texts"得到这样的结果
words_in_texts(['hello', 'bye', 'world'],
pd.Series(['hello', 'hello world hello'])
array([[1, 0, 0],
[1, 0, 1]])
我相信这个函数的参数应该是一个包含所有单词和系列的列表。
def words_in_texts(words, texts):
'''
Args:
words (list-like): words to find
texts (Series): strings to search in
Returns:
NumPy array of 0s and 1s with shape (n, p) where n is the
number of texts and p is the number of words.
'''
indicator_array = texts.str.contains(words)
return indicator_array
我对如何创建二维数组结果感到困惑,有人可以帮我解决这个问题吗?提前谢谢!
答案 0 :(得分:2)
使用sklearn.feature_extraction.text.CountVectorizer:
In [52]: from sklearn.feature_extraction.text import CountVectorizer
In [53]: vect = CountVectorizer(vocabulary=['hello', 'bye', 'world'], binary=True)
In [54]: X = vect.fit_transform(pd.Series(['hello', 'hello world hello']))
作为稀疏矩阵的结果:
In [55]: X
Out[55]:
<2x3 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
你可以把它转换成密集矩阵:
In [56]: X.A
Out[56]:
array([[1, 0, 0],
[1, 0, 1]], dtype=int64)
功能(列名称):
In [57]: vect.get_feature_names()
Out[57]: ['hello', 'bye', 'world']