我正在尝试开发网络抓取工具。我有一个python脚本和一个javascript代码.Python脚本调用javascript代码。我的javascript代码从网页中检索相关内容。并将此内容返回给python脚本。当我们在浏览器上手动运行它时,Javascript代码工作正常。 这是我的js代码:
var doc = ""
var path1 = document.getElementsByClassName("entry-header")[0]
doc = doc + path1.innerText
doc = doc + "\n"
var path2 = document.getElementsByClassName("entry-content")[0]
var cont = path2.getElementsByTagName("p")
for (var i=0; i<cont.length; i++)
{
doc = doc+cont[i].innerText
doc = doc+ "\n"
}
res()
function res()
{
return doc
}
这是我的python代码:
from selenium import webdriver
js = open("generalized.js", "r").read()
driver = webdriver.Firefox()
browser = webdriver.Firefox()
browser.get("http://www.geeksforgeeks.org/branch-and-bound-set-1- introduction-with-01-knapsack/")
result = driver.execute_script(js)
print result
但是当通过python调用时,它会给我以下错误。
Traceback (most recent call last):
File "sample.py", line 7, in <module>
result = driver.execute_script(js)
File "/home/sagar/anaconda2/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 543, in execute_script
'args': converted_args})['value']
File "/home/sagar/anaconda2/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 308, in execute
self.error_handler.check_response(response)
File "/home/sagar/anaconda2/lib/python2.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 194, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: TypeError: p[0] is undefined
请帮我解决这个问题。或者还有其他网页抓取方式吗?
答案 0 :(得分:0)
出于某种原因,您启动了两个浏览器,但在浏览器中执行了打开空白页面的脚本。这对我有用:
from selenium import webdriver
import time
js = open("generalized.js", "r").read()
browser = webdriver.Firefox()
browser.get("http://www.geeksforgeeks.org/branch-and-bound-set-1-introduction-with-01-knapsack/")
time.sleep(1) # try to replace with an Explicit Wait
result = browser.execute_script(js)
print(result)
使用顶级return doc
的修改后的脚本:
var doc = "";
var path1 = document.getElementsByClassName("entry-header")[0];
doc = doc + path1.innerText;
doc = doc + "\n";
var path2 = document.getElementsByClassName("entry-content")[0];
var cont = path2.getElementsByTagName("p");
for (var i=0; i<cont.length; i++)
{
doc = doc+cont[i].innerText;
doc = doc+ "\n"
}
return doc;