我是PySpark的新手。我在玩tfidf。只是想检查他们是否给出了相同的结果。但是他们不一样。这就是我所做的。
target['Temp_class'] = pd.qcut(target['Temeratue'], 10, labels=False)
我将PySpark df转换为熊猫
# create the PySpark dataframe
sentenceData = sqlContext.createDataFrame((
(0.0, "Hi I heard about Spark"),
(0.0, "I wish Java could use case classes"),
(1.0, "Logistic regression models are neat")
)).toDF("label", "sentence")
# tokenize
tokenizer = Tokenizer().setInputCol("sentence").setOutputCol("words")
wordsData = tokenizer.transform(sentenceData)
# vectorize
vectorizer = CountVectorizer(inputCol='words', outputCol='vectorizer').fit(wordsData)
wordsData = vectorizer.transform(wordsData)
# calculate scores
idf = IDF(inputCol="vectorizer", outputCol="tfidf_features")
idf_model = idf.fit(wordsData)
wordsData = idf_model.transform(wordsData)
# dense the current response variable
def to_dense(in_vec):
return DenseVector(in_vec.toArray())
to_dense_udf = udf(lambda x: to_dense(x), VectorUDT())
# create dense vector
wordsData = wordsData.withColumn("tfidf_features_dense", to_dense_udf('tfidf_features'))
,然后只需使用sklearn的tfidf进行如下计算
wordsData_pandas = wordsData.toPandas()
但是不幸的是,我正在为PySpark买到这个
def dummy_fun(doc):
return doc
# create sklearn tfidf
tfidf = TfidfVectorizer(
analyzer='word',
tokenizer=dummy_fun,
preprocessor=dummy_fun,
token_pattern=None)
# transform and get idf scores
feature_matrix = tfidf.fit_transform(wordsData_pandas.words)
# create sklearn dtm matrix
sklearn_tfifdf = pd.DataFrame(feature_matrix.toarray(), columns=tfidf.get_feature_names())
# create PySpark dtm matrix
spark_tfidf = pd.DataFrame([np.array(i) for i in wordsData_pandas.tfidf_features_dense], columns=vectorizer.vocabulary)
这是sklearn,
<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>i</th> <th>are</th> <th>logistic</th> <th>case</th> <th>spark</th> <th>hi</th> <th>about</th> <th>neat</th> <th>could</th> <th>regression</th> <th>wish</th> <th>use</th> <th>heard</th> <th>classes</th> <th>java</th> <th>models</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>0.287682</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.693147</td> <td>0.693147</td> <td>0.693147</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.693147</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> </tr> <tr> <th>1</th> <td>0.287682</td> <td>0.000000</td> <td>0.000000</td> <td>0.693147</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.693147</td> <td>0.000000</td> <td>0.693147</td> <td>0.693147</td> <td>0.000000</td> <td>0.693147</td> <td>0.693147</td> <td>0.000000</td> </tr> <tr> <th>2</th> <td>0.000000</td> <td>0.693147</td> <td>0.693147</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.693147</td> <td>0.000000</td> <td>0.693147</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.693147</td> </tr> </tbody></table>
我确实尝试了<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>i</th> <th>are</th> <th>logistic</th> <th>case</th> <th>spark</th> <th>hi</th> <th>about</th> <th>neat</th> <th>could</th> <th>regression</th> <th>wish</th> <th>use</th> <th>heard</th> <th>classes</th> <th>java</th> <th>models</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>0.355432</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.467351</td> <td>0.467351</td> <td>0.467351</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.467351</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> </tr> <tr> <th>1</th> <td>0.296520</td> <td>0.000000</td> <td>0.000000</td> <td>0.389888</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.389888</td> <td>0.000000</td> <td>0.389888</td> <td>0.389888</td> <td>0.000000</td> <td>0.389888</td> <td>0.389888</td> <td>0.000000</td> </tr> <tr> <th>2</th> <td>0.000000</td> <td>0.447214</td> <td>0.447214</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.447214</td> <td>0.000000</td> <td>0.447214</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.447214</td> </tr> </tbody></table>
,use_idf
参数。但是似乎两者都不一样。我想念什么?任何帮助表示赞赏。预先感谢。
答案 0 :(得分:7)
这是因为IDF的计算方式在两者之间略有不同。
来自sklearn的documentation:
与pyspark的documentation进行比较:
除了在IDF中添加1外,sklearn TF-IDF还使用了pyspark没有的l2规范
TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)
答案 1 :(得分:4)
tfidf得分的Python和Pyspark实现都是相同的。引用相同的Sklearn文档,但在下一行
它们之间的主要区别在于Sklearn默认使用l2
规范,而Pyspark则不是这种情况。如果将标准设置为“无”,则在sklearn中也将得到相同的结果。
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import pandas as pd
corpus = ["I heard about Spark","I wish Java could use case classes","Logistic regression models are neat"]
corpus = [sent.lower().split() for sent in corpus]
def dummy_fun(doc):
return doc
tfidfVectorizer=TfidfVectorizer(norm=None,analyzer='word',
tokenizer=dummy_fun,preprocessor=dummy_fun,token_pattern=None)
tf=tfidfVectorizer.fit_transform(corpus)
tf_df=pd.DataFrame(tf.toarray(),columns= tfidfVectorizer.get_feature_names())
tf_df
请参阅我的答案here,以了解规范如何与tf-idf矢量化器一起使用。