我正在制作一个访问页面并抓取一些数据的脚本。该页面的URL是从Google Spreadsheet加载的。我想为每个单元格在A列中的文本重复此脚本。
A列具有各行,每行均包含不同的URL: A1:https://www.bol.com/nl/p/m-line-athletic-pillow/9200000042954350/?suggestionType=typedsearch&bltgh=oOLF6wrL80g-ozfXiYFIZg.1.2.ProductImage A2:https://www.bol.com/nl/p/apollo-bonell-matras-90x200-cm-medium/9200000046271731/?suggestionType=typedsearch&bltgh=i745aole4Xm4c6Gl23BM3w.1.2.ProductTitle A3:等等 ...
该脚本仅在A1上有效,如何自定义它以便在所有行上重复?请帮忙!
我当时正在考虑创建一个“ for循环”,但是它不起作用。
from selenium import webdriver
import time
from bs4 import BeautifulSoup
import gspread
from oauth2client.service_account import ServiceAccountCredentials
import datetime
import re
scope = ["https://spreadsheets.google.com/feeds",'https://www.googleapis.com/auth/spreadsheets',"https://www.googleapis.com/auth/drive.file","https://www.googleapis.com/auth/drive"]
creds = ServiceAccountCredentials.from_json_keyfile_name("/Users/Jeffrey/Downloads/bolscraper.json", scope)
client = gspread.authorize(creds)
sheet = client.open("Scraper")
results = sheet.sheet1
itemList = sheet.worksheet('LoadThisList')
date = str(datetime.date.today().strftime("%d-%m-%Y"))
def inject_scraping():
browser = webdriver.Chrome('/Users/Jeffrey/Downloads/chromedriver')
browser.get(itemlist.acell('A1').value)
time.sleep(1)
browser.find_element_by_xpath('//*[@id="quantityDropdown"]').send_keys('5')
time.sleep(1)
browser.find_element_by_xpath('//*[@id="quantityDropdown"]').send_keys('meer')
time.sleep(1)
browser.find_element_by_css_selector('.text-input--two-digits').click()
time.sleep(0.5)
browser.find_element_by_css_selector('.text-input--two-digits').send_keys('00')
time.sleep(0.5)
browser.find_element_by_link_text('OK').click()
time.sleep(0.5)
browser.find_element_by_partial_link_text("In winkelwagen").click()
time.sleep(2)
page_source = browser.page_source
soup = BeautifulSoup(page_source, 'lxml')
browser.find_element_by_css_selector('.modal__window--close-hitarea').click()
page_soup = BeautifulSoup(page_source, "html.parser")
seller = page_soup.select_one('div.buy-block__seller > div > a')
sellertext = seller.findAll(text=True)
sellername = str(sellertext)
actualseller = re.sub("[^a-zA-Z0-9\s:]", "", sellername)
bucket = page_soup.select_one('#basket')
bucketnumber = bucket.findAll(text=True)
bucketDef = str(bucketnumber)
bucketactual = re.sub("\D", "", bucketDef)
producttitle = page_soup.select_one('body > div.main > div > div.constrain.constrain--main.h-bottom--m > div.pdp-header.slot.slot--pdp-header.js_slot-title > h1 > span')
producttitleText = producttitle.findAll(text=True)
producttitleDef = str(producttitleText)
actualproducttitle = re.sub("[\[\]\']", "", producttitleDef)
productprice = page_soup.select_one('body > div.main > div.product_page_two-column > div.constrain.constrain--main.h-bottom--m > div.\[.fluid-grid.fluid-grid--rwd--l.\].new_productpage > div:nth-child(2) > div.slot.slot--buy-block.slot--seperated > div > wsp-visibility-switch > section > section > div > div > span')
productpriceText = productprice.findAll(text=True)
productpriceDef = str(productpriceText)
actualprice = re.sub("[\D]", "", productpriceDef)
newRow = [date, actualseller, bucketactual, actualprice, actualproducttitle]
results.append_row(newRow)
inject_scraping()
答案 0 :(得分:0)
您的使用for循环的建议应该可以完成工作。将所有单元格名称放入列表中,然后遍历该列表,如下所示
注意:请确保将硬编码的单元格名称“ A1”替换为“单元格”(如循环中所定义),否则它将仅在该单个(硬编码)单元格上运行。您可能早些时候忘记了这一步。
from selenium import webdriver
import time
from bs4 import BeautifulSoup
import gspread
from oauth2client.service_account import ServiceAccountCredentials
import datetime
import re
scope = ["https://spreadsheets.google.com/feeds",'https://www.googleapis.com/auth/spreadsheets',"https://www.googleapis.com/auth/drive.file","https://www.googleapis.com/auth/drive"]
creds = ServiceAccountCredentials.from_json_keyfile_name("/Users/Jeffrey/Downloads/bolscraper.json", scope)
client = gspread.authorize(creds)
sheet = client.open("Scraper")
results = sheet.sheet1
itemList = sheet.worksheet('LoadThisList')
date = str(datetime.date.today().strftime("%d-%m-%Y"))
cells = ['A1', 'A2', 'A3']
def inject_scraping():
for cell in cells:
browser = webdriver.Chrome('/Users/Jeffrey/Downloads/chromedriver')
browser.get(itemlist.acell(cell).value)
## ... Rest of your scraper code ...
browser.close()
在扩展方面,您可以编写类似的函数来填充“单元格”列表,因此您不必像上面那样对它们进行硬编码。
确保输入browser.close()
以关闭网络驱动程序。甚至更好的办法是使用setup()
和teardown()
方法定义一个类,您可以在其中定义这些内容。