使用代理和Selenium时URL更改

时间:2019-03-04 20:29:47

标签: python selenium proxy

我是网络爬虫的新手,所以请原谅我的无知。

我构建了一个刮除Zillow的程序,大部分情况下一切正常。我的问题是我正在使用名为proxycrawl的代理服务,该服务可以轻松地将代理集成到程序中。这可以通过在我的实际网址之前放置https://api.proxycrawl.com/?token=xxx&url=来完成。我注意到的是,当程序单击“ a”标签时,URL变为以下示例:

之前: Before Click

之后: After Click

任何11单击该程序或手动导致该站点更改为proxycrawl站点,在该站点我收到404错误。有什么想法吗?

#Browser open
print(".....Opening Browser.....")
Browser = webdriver.Chrome(executable_path='/usr/local/bin/chromedriver')
Browser.maximize_window()

#browser page
url = urllib.parse.quote_plus('https://www.zillow.com/homes/for_sale/Bakersfield-CA-93312/house,mobile,land,townhouse_type/97227_rid/35.4606,-119.037467,35.317856,-119.200888_rect/12_zm/0_mmm/')
Browser.get('https://api.proxycrawl.com/?token=xxx&url=' + url)
print("Opening Zillow")
time.sleep(10)

last_page = int(Browser.find_element_by_xpath("""//ol[@class="zsg-pagination"]//li[last()-1]""").text)
#print last_page
page = 0
count = 0

csv_file = open('listings.csv','w')

fieldnames = ['address', 'price', 'zestimate', 'beds', 'baths', 'feet', 'desc', 'Type', 'year_built', 'heating', 'cooling', 'parking', 'lot',
               'days_on_market', 'pricepsqr', 'saves', 'interior', 'spaces_amenities', 'construction', 'exterior', 'parking1', 'mls', 'other']

writer = csv.DictWriter(csv_file, fieldnames=fieldnames)

writer.writeheader()
for i in range(last_page):
    page = page + 1
    n = 0
    listings = Browser.find_elements_by_xpath("""//*[@id="search-results"]/ul/li""")

    for i in range(len(listings)):
        n = i + 1

        listing_dict = {}

        print("Scraping the listing number {0} on page {1}, the count is {2}".format(n, page, count))
        if (count) % 11 == 0:
            listings = Browser.find_elements_by_xpath('//*[@id="search-results"]/ul/li')
            time.sleep(2)

            try:
                # Finds Listings
                 listings = Browser.find_elements_by_xpath("""//*[@id="search-results"]/ul/li""")
                 print("Looking Up listings")

                 # Opens Listing
                 listings[i].find_elements_by_tag_name('a')[0].click()
                 print("Opening Listing")
                 time.sleep(2)

                 # Opens "See More Tab"
                 Browser.find_element_by_partial_link_text('See More').click()

                 # Prepare for Scrape
                 time.sleep(2)

我确实与proxycrawl交谈,他们说必须对URL进行编码,但是我确实很走运。编码后,我回复并得到以下语句:

“”您发送的请求是经过双重编码的,并且收到pc_status的响应:602。这些请求失败,应该对其进行修复。请仅对URL进行一次编码,对URL进行多次编码将导致请求失败。”

1 个答案:

答案 0 :(得分:0)

页面似乎正在尝试相对地重定向您。

在此特定用例中,您可以通过执行以下操作来解决编码问题

# https://api.proxycrawl.com/homes/for_sale/Test/one,two
x = driver.current_url

#/homes/for_sale/Test/one,two
r = x[26:]

# base url = https://api.proxycrawl.com/?token=xxx&url=
u = base_url + r

driver.get(u)