我正在编写一些解析脚本,需要访问许多网页,例如one。
每当我尝试使用urlopen
然后read()
获取此页面时,我会被重定向到此page。
当我从谷歌Chrome重定向启动相同的链接时,但很少发生,但大多数情况下,当我尝试启动网址时,不是通过从网站菜单中点击它。
有没有办法躲避重定向或模拟使用python3从网站菜单跳转到网址?
示例代码:
def getItemsFromPage(url):
with urlopen(url) as page:
html_doc = str(page.read())
return re.findall('(http://www.charitynavigator.org/index.cfm\?bay=search\.summary&orgid=[\d]+)', html_doc)
url = 'http://www.charitynavigator.org/index.cfm?bay=search.alpha<r=1'
items_urls = getItemsFromPage(url)
with urlopen(item_urls[0]) as item_page:
print(item_page.read().decode('utf-8')) # Here i get search.advanced instead of item page
答案 0 :(得分:0)
您的问题并未在网址字符串中将&
替换为&
。我使用urllib3重写您的代码,如下所示,并获得了预期的网页。
import re
import urllib3
def getItemsFromPage(url):
# create connection pool object (urllib3-specific)
localpool = urllib3.PoolManager()
with localpool.request('get', url) as page:
html_doc = page.data.decode('utf-8')
return re.findall('(http://www.charitynavigator.org/index.cfm\?bay=search\.summary&orgid=[\d]+)', html_doc)
# the master webpage
url_master = 'http://www.charitynavigator.org/index.cfm?bay=search.alpha<r=1'
# name and store the downloaded contents for testing purpose.
file_folder = "R:"
file_mainname = "test"
# parse the master webpage
items_urls = getItemsFromPage(url_master)
# create pool
mypool = urllib3.PoolManager()
i=0;
for url in items_urls:
# file name to be saved
file_name = file_folder + "\\" + file_mainname + str(i) + ".htm"
# replace '&' with r'&'
url_OK = re.sub(r'&', r'&', url)
# print revised url
print(url_OK)
### the urllib3-pythonic way of web page retrieval ###
with mypool.request('get', url_OK) as page, open(file_name, 'w') as f:
f.write(page.data.decode('utf-8'))
i+=1
(在python 3.4 eclipse PyDev win7 x64上验证)
答案 1 :(得分:0)
事实上,原始html数据中的&符是一个奇怪的问题。当您访问网页并点击链接&符号(&amp; amp)时,网络浏览器会将其读取为&#34;&amp;&#34;它的工作原理。但是,python按原样读取数据,即原始数据。所以:
import urllib.request as net
from html.parser import HTMLParser
import re
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:36.0) Gecko/20100101 Firefox/36.0",
}
def unescape(items):
html = HTMLParser()
unescaped = []
for i in items:
unescaped.append(html.unescape(i))
return unescaped
def getItemsFromPage(url):
request = net.Request(url, headers=headers)
response = str(net.urlopen(request).read())
# --------------------------
# FIX AMPERSANDS - unescape
# --------------------------
links = re.findall('(http://www.charitynavigator.org/index.cfm\?bay=search\.summary&orgid=[\d]+)', response)
unescaped_links = unescape(links)
return unescaped_links
url = 'http://www.charitynavigator.org/index.cfm?bay=search.alpha<r=1'
item_urls = getItemsFromPage(url)
request = net.Request(item_urls[0], headers=headers)
print(item_urls)
response = net.urlopen(request)
# DEBUG RESPONSE
print(response.url)
print(80 * '-')
print("<title>Charity Navigator Rating - 10,000 Degrees</title>" in (response.read().decode('utf-8')))