Question

我正在编写一些解析脚本，需要访问许多网页，例如one。

每当我尝试使用urlopen然后read()获取此页面时，我会被重定向到此page。

当我从谷歌Chrome重定向启动相同的链接时，但很少发生，但大多数情况下，当我尝试启动网址时，不是通过从网站菜单中点击它。

有没有办法躲避重定向或模拟使用python3从网站菜单跳转到网址？

示例代码：

def getItemsFromPage(url):
    with urlopen(url) as page:
        html_doc = str(page.read())
    return re.findall('(http://www.charitynavigator.org/index.cfm\?bay=search\.summary&amp;orgid=[\d]+)', html_doc)

url = 'http://www.charitynavigator.org/index.cfm?bay=search.alpha&ltr=1'
items_urls = getItemsFromPage(url)
with urlopen(item_urls[0]) as item_page:
    print(item_page.read().decode('utf-8')) # Here i get search.advanced instead of item page

Answer 1

您的问题并未在网址字符串中将&替换为&。我使用urllib3重写您的代码，如下所示，并获得了预期的网页。

import re
import urllib3

def getItemsFromPage(url):
    # create connection pool object (urllib3-specific)
    localpool = urllib3.PoolManager() 
    with localpool.request('get', url) as page:
        html_doc = page.data.decode('utf-8')
    return re.findall('(http://www.charitynavigator.org/index.cfm\?bay=search\.summary&amp;orgid=[\d]+)', html_doc)

# the master webpage
url_master = 'http://www.charitynavigator.org/index.cfm?bay=search.alpha&ltr=1'
# name and store the downloaded contents for testing purpose.
file_folder = "R:"
file_mainname = "test"

# parse the master webpage
items_urls = getItemsFromPage(url_master)

# create pool
mypool = urllib3.PoolManager()

i=0;
for url in items_urls:
    # file name to be saved
    file_name = file_folder + "\\" + file_mainname + str(i) + ".htm"
    # replace '&amp;' with r'&'
    url_OK = re.sub(r'&amp;', r'&', url)
    # print revised url
    print(url_OK) 
    ### the urllib3-pythonic way of web page retrieval ###
    with mypool.request('get', url_OK) as page, open(file_name, 'w') as f:
        f.write(page.data.decode('utf-8'))
    i+=1

（在python 3.4 eclipse PyDev win7 x64上验证）

Answer 2

事实上，原始html数据中的＆符是一个奇怪的问题。当您访问网页并点击链接＆符号（＆amp; amp）时，网络浏览器会将其读取为＆＃34;＆amp;＆＃34;它的工作原理。但是，python按原样读取数据，即原始数据。所以：

import urllib.request as net
from html.parser import HTMLParser
import re 

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:36.0) Gecko/20100101 Firefox/36.0",
}

def unescape(items):
    html = HTMLParser()
    unescaped = []
    for i in items:
        unescaped.append(html.unescape(i))

    return unescaped


def getItemsFromPage(url):
    request = net.Request(url, headers=headers)
    response = str(net.urlopen(request).read())
    # --------------------------
    # FIX AMPERSANDS - unescape
    # --------------------------
    links = re.findall('(http://www.charitynavigator.org/index.cfm\?bay=search\.summary&amp;orgid=[\d]+)', response)
    unescaped_links = unescape(links)

    return unescaped_links


url = 'http://www.charitynavigator.org/index.cfm?bay=search.alpha&ltr=1'
item_urls = getItemsFromPage(url)
request = net.Request(item_urls[0], headers=headers)
print(item_urls)
response = net.urlopen(request)

# DEBUG RESPONSE 
print(response.url)
print(80 * '-')

print("<title>Charity Navigator Rating - 10,000 Degrees</title>" in (response.read().decode('utf-8')))

有没有办法忽略302 Moved Temporarily重定向或查找它是由什么引起的？

2 个答案: