在pyspark的官方文档中,有一个tf-idf的示例。
DevTools listening on ws://127.0.0.1:59055/devtools/browser/916a2801-ec87-4cf7-afe7-685cecb96123
[21996:24172:1218/183104.917:ERROR:gpu_process_transport_factory.cc(980)] Lost UI shared context.
[21996:24172:1218/183107.343:ERROR:textfield.cc(1767)] NOT IMPLEMENTED
Traceback (most recent call last):
File "tom.py", line 14, in <module>
problemDetails = driver.execute_script("var win = this.browserbot.getUserWindow(); return win.problemDetails")
File "C:\Python\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 636, in execute_script
'args': converted_args})['value']
File "C:\Python\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "C:\Python\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 241, in check_response
raise exception_class(message, screen, stacktrace, alert_text)
selenium.common.exceptions.UnexpectedAlertPresentException: Alert Text: None
Message: unexpected alert open: {Alert text : Copy/Paste INC Notes}
(Session info: chrome=xx.xxx.xxx.xxx)
(Driver info: chromedriver=2.45.615291 (ec3682e3c9061c10f26ea9e5cdcf3c53f3f74387),platform=Windows NT 10.0.14393 x86_64)
我也准备在其他来源中提供类似的代码。问题是:为什么该数据框的名称是tfidf?结果等于tf * idf还是仅存储idf?如果可以,如何计算tf * idf?
答案 0 :(得分:0)
如documentation中所述,HashingTF
是Transformer
,它使用令牌集并生成项频率向量。 TF已包含在此步骤中。
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
下一步-IDF
是Estimator
,适合数据集并产生IDFModel
。在此步骤中,将IDF作为IDFModel
权重令牌进行合并,这些令牌经常出现。
idf = IDF(inputCol="rawFeatures", outputCol="features")
idf
估算器必须适合产生变压器。因此,最后的步骤是:
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)