我有一个简单的网页,其中包含我要抓取的100篇文章的列表。页面加载后,将在后台运行javascript并检索文章列表。 该网页的链接为:https://tools.wmflabs.org/topviews/?project=en.wikinews.org&platform=all-access&date=2016-01&excludes=Main_page
要检索前30篇文章,我已编写此代码。
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
import numpy as np
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import os
from time import sleep
from datetime import datetime
class WikiNewsExtraction():
def __init__(self):
self.all_articles_name = []
self.all_articles_links = []
self.initial_month = '2016-01'
self.fixed_url = 'https://tools.wmflabs.org/topviews/?\
project=en.wikinews.org&platform=all-access&date='
self.exclude_page = '&excludes=Main_page'
self.id_of_first_article_name = '//*[@id="topview-entry-1"]/td[2]/div'
self.number_of_links_extracted = 30
self.beginning_year = 2016
self.end_year = 2018
def setup_selenium_driver(self):
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--window-size=1024x1400")
# download Chrome Webdriver
# https://sites.google.com/a/chromium.org/chromedriver/download
# put driver executable file in the script directory
chrome_driver_path = os.path.join(os.getcwd(), "chromedriver")
self.driver = webdriver.Chrome(options=chrome_options,
executable_path=chrome_driver_path)
def load_articles(self):
self.all_articles_name.append('Article Name')
self.all_articles_links.append('Article Link')
for year in range(self.beginning_year,self.end_year+1):
if year == self.end_year:
end_month = datetime.today().month -1
else:
end_month = 12
for month in range(1,end_month+1):
if month<10:
current_month = str(year)+'-0'+str(month)
else:
current_month = str(year)+'-'+str(month)
url = self.fixed_url+current_month+self.exclude_page
print('url = '+str(url))
self.driver.get(url)
self.driver.implicitly_wait(100)
for index in range(self.number_of_links_extracted):
xpath=self.id_of_first_article_name.replace('1',str(index+1))
dom_element = self.driver.find_element_by_xpath(xpath)
article_name = dom_element.text
article_link = dom_element.get_attribute('href')
self.all_articles_name.append(article_name)
self.all_articles_links.append(article_link)
print('Done '+str(month)+','+str(year)+'..')
np.savetxt("Extracted_Data.csv",
np.column_stack((self.all_articles_name, self.all_articles_links)),
delimiter=",", fmt='%s')
extract = WikiNewsExtraction()
extract.setup_selenium_driver()
extract.load_articles()
运行时会给出xpath = // * [@ id =“ topview-entry-3”] / td [2] / div的错误 即用于表的第三项,或用于表的其他编号项。而如果提取不是循环执行的,并且如果文章是通过上述xpath直接提取的,则它将返回正确的数据。我不明白为什么会这样。我已经尝试过在hidden_wait()中等待更长的时间,并且尝试了driver.refresh(),但是问题仍然存在
请帮助。
答案 0 :(得分:0)
如果仔细观察,则在for循环中迭代xpath下方的索引时,对于3次迭代,for循环索引将为2,从而似乎被替换,td [2]中的'2'也使xpath过时
self.id_of_first_article_name = '//*[@id="topview-entry-1"]/td[2]/div'
因此,使用这种方法:给商品编号一个唯一的名称,例如:ART_ENTRY,并在迭代中替换它,如下所示:
self.id_of_first_article_name = '//*[@id="topview-entry-ART_ENTRY"]/td[2]/div'
For循环:
for index in range(self.number_of_links_extracted):
xpath=self.id_of_first_article_name.replace('ART_ENTRY',str(index+1))
dom_element = self.driver.find_element_by_xpath(xpath)
效果很好的示例代码段:
driver.implicitly_wait(5)
driver.get('https://tools.wmflabs.org/topviews/?project=en.wikinews.org&platform=all-access&date=2016-01&excludes=Main_page')
id_of_first_article_name ='//*[@id="topview-entry-ART_ENTRY"]/td[2]/div'
number_of_links_extracted = 30
for index in range(number_of_links_extracted):
print((driver.find_element_by_xpath(id_of_first_article_name.replace('ART_ENTRY',str(index+1))).text).encode('utf-8'))