我想通过Selenium抓一个网站,总共有10页。我的代码如下,但为什么我只能获得第一页结果:
# -*- coding: utf-8 -*-
from selenium import webdriver
from scrapy.selector import Selector
MAX_PAGE_NUM = 10
MAX_PAGE_DIG = 3
driver = webdriver.Chrome('C:\Users\zhang\Downloads\chromedriver_win32\chromedriver.exe')
with open('results.csv', 'w') as f:
f.write("Buyer, Price \n")
for i in range(1, MAX_PAGE_NUM + 1):
page_num = (MAX_PAGE_DIG - len(str(i))) * "0" + str(i)
url = "https://www.oilandgasnewsworldwide.com/Directory1/DREQ/Drilling_Equipment_Suppliers_?page=" + page_num
driver.get(url)
names = sel.xpath('//*[@class="fontsubsection nomarginpadding lmargin opensans"]/text()').extract()
Countries = sel.xpath('//td[text()="Country:"]/following-sibling::td/text()').extract()
websites = sel.xpath('//td[text()="Website:"]/following-sibling::td/a/@href').extract()
driver.close()
print(len(names), len(Countries), len(websites))
答案 0 :(得分:1)
首先,我将每个页面的名称,国家和网站都显示为find_elements_by_xpath
,然后将它们存储到列表中。从列表中的每个元素中提取文本,并将值添加到新列表中。
from selenium import webdriver
MAX_PAGE_NUM = 10
driver = webdriver.Chrome('C:\\Users...\\chromedriver.exe')
names_list = list()
Countries_list = list()
websites_list = list()
# The for loop is to get each of the 10 pages
for i in range(1, MAX_PAGE_NUM):
page_num = str(i)
url = "https://www.oilandgasnewsworldwide.com/Directory1/DREQ/Drilling_Equipment_Suppliers_?page=" + page_num
driver.get(url)
# Use "driver.find_elements" instead of "driver.find_element" to get all of them. You get a list of WebElements of each page
names = driver.find_elements_by_xpath("//*[@class='fontsubsection nomarginpadding lmargin opensans']")
# To get the value of each WebElement in the list. You have to iterate on the list
for i in range(0, len(names)):
# Now you add each value into a new list
names_list.append(names[i].text)
Countries = driver.find_elements_by_xpath("//td[text()='Country:']/following-sibling::td")
for i in range(0, len(Countries)):
Countries_list.append(Countries[i].text)
websites = driver.find_elements_by_xpath("//td[text()='Website:']/following-sibling::td")
for i in range(0, len(websites)):
websites_list.append(websites[i].text)
print(names_list)
print(Countries_list)
print(websites_list)
driver.close()
我希望这对你有用
选项:获取<div class = border fontcontentdet>
中包含的部分中的所有数据。
MAX_PAGE_NUM = 10
driver = webdriver.Chrome('C:\\Users\\LVARGAS\\AppData\\Local\\Programs\\Python\\Python36-32\\Scripts\\chromedriver.exe')
data_list = list()
# The for loop is to get each of the 10 pages
for i in range(1, MAX_PAGE_NUM):
page_num = str(i)
url = "https://www.oilandgasnewsworldwide.com/Directory1/DREQ/Drilling_Equipment_Suppliers_?page=" + page_num
driver.get(url)
rows = driver.find_elements_by_xpath("//*[@class='border fontcontentdet']")
for i in range(0, len(rows)):
print(rows[i].text)
data_list.append(rows[i].text)
print('---')
driver.close()
print(data_list)
答案 1 :(得分:0)
我的猜测是,它与你在page_num赋值中所做的奇怪事情有关。要进行调试,请在调用driver.get(url)后尝试添加此行:
$('dataBlock').empty();
如果它返回您期望的URL,则很可能问题出在您的XPATH中。