抓取javascript网址,但selenium返回空字符串

时间:2016-06-09 16:00:12

标签: javascript selenium selenium-webdriver web-scraping

我试图打开然后从包含在标签中的网址中抓取数据,如下所示:

<script src="http://includes.mpt-static.com/data/7CE5047496" type="text/javascript" charset="utf-8"></script>

我尝试使用selenium来检索/打开网址,但它只返回一个空白字符串。我想这是因为当我直接点击src url时,会打开一个页面,其中包含我想要的数据表。但是,当我将url复制并传递到浏览器时,它返回空。此外,每次重新加载页面时,都会生成一个新的src url。有谁知道为什么会这样?

网址:     视图源:http://mypricetrack.com/amazon/B00N2BW2PK

我的代码:

import time
from fake_useragent import UserAgent
import urllib2
import csv
from bs4 import BeautifulSoup
import json
from selenium import webdriver

#FAKE-USER_AGENT
ua = UserAgent(cache = False)
headers = {'User-Agent': ua.randome}


#SENDING REQUEST TO PRICETRACKER WEBSITE
product = 'B00N2BW2PK'
page = requests.get('http://www.mypricetrack.com/amazon/'+str(product), headers = headers)
soup = BeautifulSoup(page.text)
#print(soup.prettify())

#GETTING URL FOR DATA
data_link = []
for tag in soup.findAll('script',{'charset':'utf-8'}):
    data_link = data_link + [tag['src']]
string2 = data_link[1]
print string2
#OPENING URL FOR DATA

driver = webdriver.Firefox()
driver.get(string2)
time.sleep(5)
htmlSource = driver.page_source
print htmlSource

1 个答案:

答案 0 :(得分:0)

Javascript不会被下载,除非您使用正确的标题请求#34; Referer&#34;。

Selenium有点矫枉过正,你可以使用python请求来获取它:

import requests
import re
from bs4 import BeautifulSoup
# Emulate a browser with proper headers
session.headers.update({
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1664.3 Safari/537.36',
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language':'en-US,en;q=0.8,es;q=0.6'
})
# Go to product page
product_page = 'http://mypricetrack.com/amazon/B00N2BW2PK'
res = session.get(product_page)
# find link
link = soup.find('script', {'src':re.compile('http://includes.mpt-static.com/data')})
link_src = link['src']
# Get you JS content
res = session.get(src, headers={'Referer':product_page}).text