我一直在使用功能ml实现Python / Pyspark中描述的TF-IDF方法,我有一组6个文本文档,下面的代码用来获取每个bigram的tf-idf,但是在输入中有一个sparseVector,我无法找到每本书中tf-idf最多的二元组。换句话说,我想要做的是找到最大数量的tf-idf,并使用该数字找到相应的单词,任何有用的建议?
from pyspark import SparkConf,SparkContext
from operator import add
from pyspark.sql import SparkSession
import re
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, NGram
from pyspark.sql.functions import *
conf = SparkConf()
conf.setAppName("wordCount")
conf.set("spark.executor.memory","1g")
def removePunctuation(text):
return re.sub('[^a-z| ]','',text.strip().lower())
def wholeFile(x):
name=x[0]
name=name.split('_')[1].split('/')[2]
words = re.sub('[^a-z0-9]+',' ',x[1].lower()).split()
return [(word,name) for word in list(words)]
sc=SparkContext(conf = conf)
text=sc.wholeTextFiles("/cosc6339_s17/books-shortlist/*")
text = text.map(lambda x:(x[0].split('_')[1].split('/')
[2],removePunctuation(x[1])))
spark = SparkSession(sc)
hasattr(text, "toDF")
wordDataFrame=text.toDF(["title","book"])
tokenizer = Tokenizer(inputCol="book", outputCol="words")
wordsData = tokenizer.transform(wordDataFrame)
ngram = NGram(n=2,inputCol="words", outputCol="ngrams")
ngramDataFrame = ngram.transform(wordsData)
hashingTF = HashingTF(inputCol="words", outputCol="tf")
featurizedData = hashingTF.transform(ngramDataFrame)
idf = IDF(inputCol="tf", outputCol="idf")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)
我输出的一部分是这样的
(u'30240', SparseVector(262144, {14: 0.3365, 509: 0.8473, 619: 0.5596, 1889:
0.8473, 2325: 0.1542, 2624: 0.8473, 2710: 0.5596, 2937: 1.2528, 3091: 1.2528,
3193: 1.2528, 3483: 1.2528, 3575: 1.2528, 3910: 1.2528, 3924: 0.6729, 4081:
0.6729, 4200: 0.0, 4378: 1.2528, 4774: 1.2528, 4783: 1.2528, 4868: 1.2528,
4869: 2.5055, 5213: 1.2528, 5232: 1.1192, 5381: 0.0, 5595: 0.8473, 5758:
1.2528, 5823: 1.2528, 6183: 5.5962, 6267: 1.2528, 6355: 0.8473, 6383: 1.2528,
6981: 0.3365, 7289: 1.2528, 8023: 1.2528, 8073: 0.8473, 8449: 0.0, 8733:
5.0111, 8804: 0.5596, 8854: 1.2528, 9001: 1.2528, 9129: 0.0, 9287: 1.2528,
9639: 0.0, 9988: 1.6946, 10409: 0.8473, 11104: 1.0094, 11501: 1.2528, 11951:
0.5596, 12247: 0.8473, 12312: 1.2528, 12399: 0.0, 12526: 1.2528, 12888:
1.2528, 12925: 0.8473, 13142: 0.6729,
答案 0 :(得分:0)
使用HashingTF转换器时,您的文本输入将使用哈希函数进行哈希处理。 散列的问题是不可能检索原始输入。
它遭受潜在的哈希冲突,其中不同的原始 散列后,功能可能会变成相同的术语。 see spark documentation
因此,您最好使用CountVectorizer而不是哈希tf。 Countvectorizer对项的外观(项频率)进行计数而不对项进行哈希处理。原始词汇将被保存,并可通过以下方式提取:
countVect = CountVectorizer(inputCol="words", outputCol="tf", minDF=2.0)
model = countVect.fit(wordsData)
result = model.transform(wordsData)
model.vocabulary
然后,您可以使用CountVector计算idf。
idf = IDF(inputCol="tf", outputCol="idf")
idfModel = idf.fit(result)
rescaledData = idfModel.transform(result)
rescaledData.select("name", "features")
我不确定这是否是最好的方法,但是它是否有效:)将数据框更改为pandas,现在选择功能并将其与模型词汇结合起来
rescaled_pd = rescaledData.toPandas()
rescaled_pd
现在按tfidf值或计数选择前100名
tf_idf_per_word = pd.DataFrame({'tf_idf': inputrow['1_features'].toArray(), 'vocabulary': model_vocabulary}).sort_values('tf_idf', ascending = False)
tf_idf_per_word[tf_idf_per_word.tf_idf>0.1]
tf_idf_per_word = tf_idf_per_word[0:100]