我已经搜索了各种论坛,但找不到适合我的担忧的答案。
我正在使用pymongo bson.code来映射减少MongoDB中存在的名为 MediaSummary 的文档集合。得到的集合将是一个唯一(不停)单词的列表,每个单词及其关联的文档ID。
例如,收集数据如下:
{_id: "abc", VideoData: "This video contains documentary info about Universe."}
{_id: "def", VideoData: "This video is about Milkyway."}
{_id: "ghi", VideoData: "This video is about Earth and its importance."}
Python代码:
import pymongo
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from pymongo import MongoClient
from bson.code import Code
client = MongoClient('mongodb://myuser:abc123@my-server/DataSummary')
db = client['DataSummary']
def getFilteredSentence(sentence):
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(sentence)
words = [word for word in word_tokens if word.isalpha()] # remove punctuation
filteredSentence = []
for w in words:
if (w and w.strip()):
if w.lower() not in stop_words:
filteredSentence.append(w.strip().lower())
return filteredSentence
映射器功能:
mapper = Code("""
function map() {
if (!this.VideoData) return;
var key = this._id.toString();
var value = getFS(this.VideoData); // getFS is a Python function passed to it via scope. It filters out the stop words from sentence.
emit(key, value);
};
"""
)
减速器功能:
reducer = Code("""
function reduce(key, values) {
uniqueWordsWithDocIds = [];
// Some further processing on values
return uniqueWordsWithDocIds
};
"""
在 MediaSummary 集合上调用map_reduce:
output_collection = db.MediaSummary.map_reduce(
mapper, reducer, 'SearchWords',
scope = {
'getFS': somePyThonFuncForFilteringOutStopWords # This must be a Python function that I can make use of in my mapper function
})
需要帮助。请指导。