Question

我有一个简单的网页，其中包含我要抓取的100篇文章的列表。页面加载后，将在后台运行javascript并检索文章列表。该网页的链接为：https://tools.wmflabs.org/topviews/?project=en.wikinews.org&platform=all-access&date=2016-01&excludes=Main_page

要检索前30篇文章，我已编写此代码。

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
import numpy as np
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import os
from time import sleep
from datetime import datetime

class WikiNewsExtraction():

def __init__(self):

    self.all_articles_name = []
    self.all_articles_links = []
    self.initial_month = '2016-01'
    self.fixed_url = 'https://tools.wmflabs.org/topviews/?\
    project=en.wikinews.org&platform=all-access&date='
    self.exclude_page = '&excludes=Main_page'
    self.id_of_first_article_name = '//*[@id="topview-entry-1"]/td[2]/div'
    self.number_of_links_extracted = 30
    self.beginning_year = 2016
    self.end_year = 2018


def setup_selenium_driver(self):

    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--window-size=1024x1400")

    # download Chrome Webdriver  
    # https://sites.google.com/a/chromium.org/chromedriver/download
    # put driver executable file in the script directory
    chrome_driver_path = os.path.join(os.getcwd(), "chromedriver")
    self.driver = webdriver.Chrome(options=chrome_options, 
    executable_path=chrome_driver_path)


def load_articles(self):


    self.all_articles_name.append('Article Name')
    self.all_articles_links.append('Article Link')

    for year in range(self.beginning_year,self.end_year+1):

        if year == self.end_year:
            end_month = datetime.today().month -1
        else:
            end_month = 12

        for month in range(1,end_month+1):

            if month<10:
                current_month = str(year)+'-0'+str(month)
            else:
                current_month = str(year)+'-'+str(month)

            url = self.fixed_url+current_month+self.exclude_page

            print('url = '+str(url))

            self.driver.get(url)
            self.driver.implicitly_wait(100)



            for index in range(self.number_of_links_extracted):

               xpath=self.id_of_first_article_name.replace('1',str(index+1))
               dom_element = self.driver.find_element_by_xpath(xpath)
                article_name = dom_element.text
                article_link = dom_element.get_attribute('href')
                self.all_articles_name.append(article_name)
                self.all_articles_links.append(article_link)


            print('Done '+str(month)+','+str(year)+'..')

    np.savetxt("Extracted_Data.csv", 
    np.column_stack((self.all_articles_name, self.all_articles_links)), 
    delimiter=",", fmt='%s')





extract = WikiNewsExtraction()
extract.setup_selenium_driver()
extract.load_articles()

运行时会给出xpath = // * [@ id =“ topview-entry-3”] / td [2] / div的错误即用于表的第三项，或用于表的其他编号项。而如果提取不是循环执行的，并且如果文章是通过上述xpath直接提取的，则它将返回正确的数据。我不明白为什么会这样。我已经尝试过在hidden_wait（）中等待更长的时间，并且尝试了driver.refresh（），但是问题仍然存在

请帮助。

Answer 1

如果仔细观察，则在for循环中迭代xpath下方的索引时，对于3次迭代，for循环索引将为2，从而似乎被替换，td [2]中的'2'也使xpath过时

self.id_of_first_article_name = '//*[@id="topview-entry-1"]/td[2]/div'

因此，使用这种方法：给商品编号一个唯一的名称，例如：ART_ENTRY，并在迭代中替换它，如下所示：

self.id_of_first_article_name = '//*[@id="topview-entry-ART_ENTRY"]/td[2]/div'

For循环：

for index in range(self.number_of_links_extracted):
    xpath=self.id_of_first_article_name.replace('ART_ENTRY',str(index+1))
    dom_element = self.driver.find_element_by_xpath(xpath)

效果很好的示例代码段：

driver.implicitly_wait(5)
driver.get('https://tools.wmflabs.org/topviews/?project=en.wikinews.org&platform=all-access&date=2016-01&excludes=Main_page')
id_of_first_article_name ='//*[@id="topview-entry-ART_ENTRY"]/td[2]/div'
number_of_links_extracted = 30

for index in range(number_of_links_extracted):
    print((driver.find_element_by_xpath(id_of_first_article_name.replace('ART_ENTRY',str(index+1))).text).encode('utf-8'))

在提供NoSuchElementException的循环中硒的检索

1 个答案: