有关使用pyspark的TF-IDF的概念性问题

时间:2018-12-18 23:38:05

标签: pyspark tf-idf

在pyspark的官方文档中,有一个tf-idf的示例。

DevTools listening on ws://127.0.0.1:59055/devtools/browser/916a2801-ec87-4cf7-afe7-685cecb96123
[21996:24172:1218/183104.917:ERROR:gpu_process_transport_factory.cc(980)] Lost UI shared context.
[21996:24172:1218/183107.343:ERROR:textfield.cc(1767)] NOT IMPLEMENTED
Traceback (most recent call last):
  File "tom.py", line 14, in <module>
    problemDetails = driver.execute_script("var win = this.browserbot.getUserWindow(); return win.problemDetails")
  File "C:\Python\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 636, in execute_script
    'args': converted_args})['value']
  File "C:\Python\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "C:\Python\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 241, in check_response
    raise exception_class(message, screen, stacktrace, alert_text)
selenium.common.exceptions.UnexpectedAlertPresentException: Alert Text: None
Message: unexpected alert open: {Alert text : Copy/Paste INC Notes}
  (Session info: chrome=xx.xxx.xxx.xxx)
  (Driver info: chromedriver=2.45.615291 (ec3682e3c9061c10f26ea9e5cdcf3c53f3f74387),platform=Windows NT 10.0.14393 x86_64)

我也准备在其他来源中提供类似的代码。问题是:为什么该数据框的名称是tfidf?结果等于tf * idf还是仅存储idf?如果可以,如何计算tf * idf?

1 个答案:

答案 0 :(得分:0)

documentation中所述,HashingTFTransformer,它使用令牌集并生成项频率向量。 TF已包含在此步骤中。

hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)

下一步-IDFEstimator,适合数据集并产生IDFModel。在此步骤中,将IDF作为IDFModel权重令牌进行合并,这些令牌经常出现。

idf = IDF(inputCol="rawFeatures", outputCol="features")

idf估算器必须适合产生变压器。因此,最后的步骤是:

idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)