试图在pyspark中使用带有sparseVector的输出找到具有最大tf-idf的单词或bigram

时间:2017-03-28 06:37:07

标签: python pyspark tf-idf

我一直在使用功能ml实现Python / Pyspark中描述的TF-IDF方法,我有一组6个文本文档,下面的代码用来获取每个bigram的tf-idf,但是在输入中有一个sparseVector,我无法找到每本书中tf-idf最多的二元组。换句话说,我想要做的是找到最大数量的tf-idf,并使用该数字找到相应的单词,任何有用的建议?

from pyspark import SparkConf,SparkContext
from operator import add
from pyspark.sql import SparkSession
import re
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, NGram
from pyspark.sql.functions import *

conf = SparkConf()
conf.setAppName("wordCount")
conf.set("spark.executor.memory","1g")

def removePunctuation(text):
    return re.sub('[^a-z| ]','',text.strip().lower())

def wholeFile(x):
    name=x[0]
    name=name.split('_')[1].split('/')[2]
    words = re.sub('[^a-z0-9]+',' ',x[1].lower()).split()
    return [(word,name) for word in list(words)]


sc=SparkContext(conf = conf)
text=sc.wholeTextFiles("/cosc6339_s17/books-shortlist/*")
text = text.map(lambda x:(x[0].split('_')[1].split('/')
[2],removePunctuation(x[1])))

spark = SparkSession(sc)
hasattr(text, "toDF")
wordDataFrame=text.toDF(["title","book"])
tokenizer = Tokenizer(inputCol="book", outputCol="words")
wordsData = tokenizer.transform(wordDataFrame)
ngram = NGram(n=2,inputCol="words", outputCol="ngrams")
ngramDataFrame = ngram.transform(wordsData)

hashingTF = HashingTF(inputCol="words", outputCol="tf")
featurizedData = hashingTF.transform(ngramDataFrame)


idf = IDF(inputCol="tf", outputCol="idf")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)

我输出的一部分是这样的

 (u'30240', SparseVector(262144, {14: 0.3365, 509: 0.8473, 619: 0.5596, 1889: 
 0.8473, 2325: 0.1542, 2624: 0.8473, 2710: 0.5596, 2937: 1.2528, 3091: 1.2528, 
 3193: 1.2528, 3483: 1.2528, 3575: 1.2528, 3910: 1.2528, 3924: 0.6729, 4081: 
 0.6729, 4200: 0.0, 4378: 1.2528, 4774: 1.2528, 4783: 1.2528, 4868: 1.2528, 
 4869: 2.5055, 5213: 1.2528, 5232: 1.1192, 5381: 0.0, 5595: 0.8473, 5758: 
 1.2528, 5823: 1.2528, 6183: 5.5962, 6267: 1.2528, 6355: 0.8473, 6383: 1.2528, 
 6981: 0.3365, 7289: 1.2528, 8023: 1.2528, 8073: 0.8473, 8449: 0.0, 8733: 
 5.0111, 8804: 0.5596, 8854: 1.2528, 9001: 1.2528, 9129: 0.0, 9287: 1.2528, 
 9639: 0.0, 9988: 1.6946, 10409: 0.8473, 11104: 1.0094, 11501: 1.2528, 11951: 
 0.5596, 12247: 0.8473, 12312: 1.2528, 12399: 0.0, 12526: 1.2528, 12888: 
 1.2528, 12925: 0.8473, 13142: 0.6729, 

1 个答案:

答案 0 :(得分:0)

使用HashingTF转换器时,您的文本输入将使用哈希函数进行哈希处理。 散列的问题是不可能检索原始输入。

  

它遭受潜在的哈希冲突,其中不同的原始   散列后,功能可能会变成相同的术语。 see spark documentation

因此,您最好使用CountVectorizer而不是哈希tf。 Countvectorizer对项的外观(项频率)进行计数而不对项进行哈希处理。原始词汇将被保存,并可通过以下方式提取:

countVect = CountVectorizer(inputCol="words", outputCol="tf", minDF=2.0)
model = countVect.fit(wordsData) result = model.transform(wordsData) model.vocabulary

然后,您可以使用CountVector计算idf。

idf = IDF(inputCol="tf", outputCol="idf")
idfModel = idf.fit(result)
rescaledData = idfModel.transform(result)
rescaledData.select("name", "features")

我不确定这是否是最好的方法,但是它是否有效:)将数据框更改为pandas,现在选择功能并将其与模型词汇结合起来

rescaled_pd = rescaledData.toPandas()
rescaled_pd

现在按tfidf值或计数选择前100名

tf_idf_per_word = pd.DataFrame({'tf_idf': inputrow['1_features'].toArray(), 'vocabulary': model_vocabulary}).sort_values('tf_idf', ascending = False)
 tf_idf_per_word[tf_idf_per_word.tf_idf>0.1]
 tf_idf_per_word = tf_idf_per_word[0:100]