[PYTHON 3.x] 大家好, 我正在开发一个自然语言处理项目,需要一些帮助。 我已经从所有文档中创建了不同单词的词汇表(列表)。我想根据这个词汇表列表创建每个文档的向量。 (Doc_POS_words包含100个文档,格式为Doc_POS_words [0] =第一个doc,Doc_POS_words [1] =第二个doc等。)
输出:
# Doc_POS_words = [contains all the words of each document as below]
Doc_POS_words = [
['war','life','travel','live','night'],
['books','stuent','travel','study','yellow'],
]
# myVoc = [distinct words from all the documents as below]
myVoc = [
'war',
'life',
'travel',
'live',
'night',
'books',
'student',
'study',
'yellow'
]
# myVoc_vector = [ need this as well ]
# Doc_POS_words_BoW = [need this for each document]
PS:我没有使用NLTK,因为我没有使用NLTK支持的任何语言
感谢。
答案 0 :(得分:0)
检查TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["Doc 1 words",
"Doc 2 words"]
vectorizer = TfidfVectorizer(min_df=1)
vectors = vectorizer.fit_transform(corpus)
答案 1 :(得分:0)
我仍然不确定你在问什么,所以我会给你一些帮助。我认为你需要的是使用python集。
https://docs.python.org/3/tutorial/datastructures.html#sets
以下是一些示例,使用您问题中的数据:
# create a set of the whole word list
myVocSet = set(myVoc)
for doc_words in Doc_POS_words:
# convert from list to set
doc_words = set(doc_words)
# want to find words in the doc also in the vocabulary list?
print(myVocSet.intersection(doc_words))
# want to find words in your doc not in the vocabulary list?
print(doc_words.difference(myVocSet))
# want to find words in the vocab list not used in your doc?
print(MyVocSet.difference(myVocSet))
以下是一些帮助:
>>> x = set(('a', 'b', 'c', 'd'))
>>> y = set(('c', 'd', 'e', 'f'))
>>>
>>> x.difference(y)
{'a', 'b'}
>>> y.difference(x)
{'f', 'e'}
>>> x.intersection(y)
{'c', 'd'}
>>> y.intersection(x)
{'c', 'd'}
>>> x.union(y)
{'a', 'b', 'd', 'f', 'e', 'c'}
>>> x.symmetric_difference(y)
{'a', 'b', 'f', 'e'}