如何阻止硒报废者重定向到报废网站的另一个内部Web链接?

时间:2020-09-01 16:12:24

标签: python selenium-webdriver xpath web-scraping

想知道是否有人知道一种指示Selenium脚本的方法,以避免访问/重定向到不是代码一部分的内部页面。本质上,我的代码打开了此页面:
https://cryptwerk.com/companies/?coins=1,6,11,2,3,8,17,7,13,4,25,29,24,32,9,38,15,30,43,42,41,12,40,44,20

继续单击显示更多按钮,直到没有按钮为止(页面末尾)-到那时为止-它应该已经收集了滚动到页面末尾的页面上列出的所有产品的链接,然后分别访问每个链接。

发生了什么,它成功地单击显示更多页面,直到页面结束,但随后又访问了同一网站的这个怪异的促销页面,而不是分别关注每个收集的链接,然后从每个页面抓取了更多数据点那些新开张的。

简而言之,如果有人可以自己解释如何避免这种自动重定向,那么将非常感谢您!这是代码,以防有人可以向正确的方向微移我:)

from selenium.webdriver import Chrome
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time
from selenium.common.exceptions import NoSuchElementException, ElementNotVisibleException
import json
import selenium.common.exceptions as exception

from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
from selenium import webdriver


webdriver = '/Users/karimnabil/projects/selenium_js/chromedriver-1'
driver = Chrome(webdriver)
driver.implicitly_wait(5)
   


url = 'https://cryptwerk.com/companies/?coins=1,6,11,2,3,8,17,7,13,4,25,29,24,32,9,38,15,30,43,42,41,12,40,44,20'
driver.get(url)
links_list = []
coins_list = []

all_names = []
all_cryptos = []
all_links = []
all_twitter = []
all_locations = []
all_categories = []
all_categories2 = []

wait = WebDriverWait(driver, 2)
sign_in = driver.find_element_by_xpath("//li[@class='nav-item nav-guest']/a")
sign_in.click()
time.sleep(2)

user_name = wait.until(EC.presence_of_element_located((By.XPATH, "//input[@name='login']")))
user_name.send_keys("karimnsaber95@gmail.com")

password = wait.until(EC.presence_of_element_located((By.XPATH, "//input[@name='password']")))
password.send_keys("PleomaxCW@2")

signIn_Leave = driver.find_element_by_xpath("//div[@class='form-group text-center']/button")
signIn_Leave.click()
time.sleep(3)



    


while True:
    try:
        loadMoreButton = driver.find_element_by_xpath("//button[@class='btn btn-outline-primary']")
        time.sleep(2)
        loadMoreButton.click()
        time.sleep(2)
    except exception.StaleElementReferenceException:
        print('stale element')
        break
print('no more elements to show')

try:
    company_links = driver.find_elements_by_xpath("//div[@class='companies-list items-infinity']/div[position() > 3]/div[@class='media-body']/div[@class='title']/a")
    for link in company_links:
        links_list.append(link.get_attribute('href'))
except:
    pass

try:
    with open("links_list.json", "w") as f:
        json.dump(links_list, f)

    with open("links_list.json", "r") as f:
        links_list = json.load(f)
except:
    pass
    
try:
    for link in links_list:
        driver.get(link)
        name = driver.find_element_by_xpath("//div[@class='title']/h1").text
        try:
            show_more_coins = driver.find_element_by_xpath("//a[@data-original-title='Show more']")
            show_more_coins.click()
            time.sleep(1) 
        except:
            pass
        
        try:
            categories = driver.find_elements_by_xpath("//div[contains(@class, 'categories-list')]/a")
            categories_list = []
            for category in categories:
                categories_list.append(category.text)
        except:
            pass
        try:
            top_page_categories = driver.find_elements_by_xpath("//ol[@class='breadcrumb']/li/a")
            top_page_categories_list = []
            for category in top_page_categories:
                top_page_categories_list.append(category.text)
        except:
            pass


        coins_links = driver.find_elements_by_xpath("//div[contains(@class, 'company-coins')]/a")
        all_coins = []
        for coin in coins_links:
            all_coins.append(coin.get_attribute('href'))
        try:
            location = driver.find_element_by_xpath("//div[@class='addresses mt-3']/div/div/div/div/a").text
        except:
            pass

        try:
            twitter = driver.find_element_by_xpath("//div[@class='links mt-2']/a[2]").get_attribute('href')
        except:
            pass
            
        try:
            print('-----------')
            print('Company name is: {}'.format(name))
            print('Potential Categories are: {}'.format(categories_list))
            print('Potential top page categories are: {}'.format(top_page_categories_list))
            print('Supporting Crypto is:{}'.format(all_coins))
            print('Registered location is: {}'.format(location))
            print('Company twitter profile is: {}'.format(twitter))
            time.sleep(1)
        except:
            pass

        all_names.append(name)
        all_categories.append(categories_list)
        all_categories2.append(top_page_categories_list)
        all_cryptos.append(all_coins)
        all_twitter.append(twitter)
        all_locations.append(location)



except:
    pass



df = pd.DataFrame(list(zip(all_names, all_categories, all_categories2, all_cryptos, all_twitter, all_locations)), columns=['Company name', 'Categories1', 'Categories2', 'Supporting Crypto', 'Twitter Handle', 'Registered Location'])

CryptoWerk_Data = df.to_csv('CryptoWerk4.csv', index=False) 

 

 

1 个答案:

答案 0 :(得分:1)

发生重定向调用的原因有两个,在您的情况下,是通过在最后一次单击load more按钮时执行一些javascript代码或接收到HTTP 3xx代码(在您的情况下可能性最小)来实现的。 因此,您需要确定何时执行此javascript代码,并在加载之前发送ESC_KEY,然后执行其余脚本。

在单击load more按钮之前,您还可以抓取链接并将其附加到列表中,每次单击该链接时,均应使用if语句验证您所在页面的链接(如果是)。促销页面上的代码,然后执行其余代码,否则单击“加载更多”。

  while page_is_same:
    scrape_elements_add_to_list()
    click_load_more()
    verify_current_page_link()
    if current_link_is_same != link_of_scraped_page:
      page_is_same = False
  # rest of the code here