我想在大量数据集上使用k-means和Spark进行“文本聚类”。如您所知,在运行k-means之前,我必须在我的大数据集上执行预处理方法,例如TFIDF和NLTK。以下是我在python中的代码:
if __name__ == '__main__':
# Cluster a bunch of text documents.
import re
import sys
k = 6
vocab = {}
xs = []
ns=[]
cat=[]
filename='2013-01.csv'
with open(filename, newline='') as f:
try:
newsreader = csv.reader(f)
for row in newsreader:
ns.append(row[3])
cat.append(row[4])
except csv.Error as e:
sys.exit('file %s, line %d: %s' % (filename, newsreader.line_num, e))
remove_spl_char_regex = re.compile('[%s]' % re.escape(string.punctuation)) # regex to remove special characters
remove_num = re.compile('[\d]+')
#nltk.download()
stop_words=nltk.corpus.stopwords.words('english')
for a in ns:
x = defaultdict(float)
a1 = a.strip().lower()
a2 = remove_spl_char_regex.sub(" ",a1) # Remove special characters
a3 = remove_num.sub("", a2) #Remove numbers
#Remove stop words
words = a3.split()
filter_stop_words = [w for w in words if not w in stop_words]
stemed = [PorterStemmer().stem_word(w) for w in filter_stop_words]
ws=sorted(stemed)
#ws=re.findall(r"\w+", a1)
for w in ws:
vocab.setdefault(w, len(vocab))
x[vocab[w]] += 1
xs.append(x.items())
在运行k-means之前,有人可以向我解释如何在Spark中执行预处理步骤。
答案 0 :(得分:1)
这是对用户3789843的评论的回应。
是。每个停用单词在单独的行中没有引号。 对不起,我无权发表评论。