如何为Google电子表格中的每一行重复一个脚本?

时间:2019-09-11 08:56:56

标签: python python-3.x selenium web-scraping spreadsheet

我正在制作一个访问页面并抓取一些数据的脚本。该页面的URL是从Google Spreadsheet加载的。我想为每个单元格在A列中的文本重复此脚本。

A列具有各行,每行均包含不同的URL: A1:https://www.bol.com/nl/p/m-line-athletic-pillow/9200000042954350/?suggestionType=typedsearch&bltgh=oOLF6wrL80g-ozfXiYFIZg.1.2.ProductImage A2:https://www.bol.com/nl/p/apollo-bonell-matras-90x200-cm-medium/9200000046271731/?suggestionType=typedsearch&bltgh=i745aole4Xm4c6Gl23BM3w.1.2.ProductTitle A3:等等 ...

该脚本仅在A1上有效,如何自定义它以便在所有行上重复?请帮忙!

我当时正在考虑创建一个“ for循环”,但是它不起作用。

from selenium import webdriver
import time
from bs4 import BeautifulSoup
import gspread
from oauth2client.service_account import ServiceAccountCredentials
import datetime
import re

scope = ["https://spreadsheets.google.com/feeds",'https://www.googleapis.com/auth/spreadsheets',"https://www.googleapis.com/auth/drive.file","https://www.googleapis.com/auth/drive"]
creds = ServiceAccountCredentials.from_json_keyfile_name("/Users/Jeffrey/Downloads/bolscraper.json", scope)
client = gspread.authorize(creds)
sheet = client.open("Scraper")
results = sheet.sheet1
itemList = sheet.worksheet('LoadThisList')
date = str(datetime.date.today().strftime("%d-%m-%Y"))


def inject_scraping():
   browser = webdriver.Chrome('/Users/Jeffrey/Downloads/chromedriver')
   browser.get(itemlist.acell('A1').value)
   time.sleep(1)
   browser.find_element_by_xpath('//*[@id="quantityDropdown"]').send_keys('5')
   time.sleep(1)
   browser.find_element_by_xpath('//*[@id="quantityDropdown"]').send_keys('meer')
   time.sleep(1)
   browser.find_element_by_css_selector('.text-input--two-digits').click()
   time.sleep(0.5)
   browser.find_element_by_css_selector('.text-input--two-digits').send_keys('00')
   time.sleep(0.5)
   browser.find_element_by_link_text('OK').click()
   time.sleep(0.5)
   browser.find_element_by_partial_link_text("In winkelwagen").click()
   time.sleep(2)
   page_source = browser.page_source
   soup = BeautifulSoup(page_source, 'lxml')
   browser.find_element_by_css_selector('.modal__window--close-hitarea').click()
   page_soup = BeautifulSoup(page_source, "html.parser")
   seller = page_soup.select_one('div.buy-block__seller > div > a')
   sellertext = seller.findAll(text=True)
   sellername = str(sellertext)
   actualseller = re.sub("[^a-zA-Z0-9\s:]", "", sellername)
   bucket = page_soup.select_one('#basket')
   bucketnumber = bucket.findAll(text=True)
   bucketDef = str(bucketnumber)
   bucketactual = re.sub("\D", "", bucketDef)

   producttitle = page_soup.select_one('body > div.main > div > div.constrain.constrain--main.h-bottom--m > div.pdp-header.slot.slot--pdp-header.js_slot-title > h1 > span')
   producttitleText = producttitle.findAll(text=True)
   producttitleDef = str(producttitleText)
   actualproducttitle = re.sub("[\[\]\']", "", producttitleDef)
   productprice = page_soup.select_one('body > div.main > div.product_page_two-column > div.constrain.constrain--main.h-bottom--m > div.\[.fluid-grid.fluid-grid--rwd--l.\].new_productpage > div:nth-child(2) > div.slot.slot--buy-block.slot--seperated > div > wsp-visibility-switch > section > section > div > div > span')
   productpriceText = productprice.findAll(text=True)
   productpriceDef = str(productpriceText)
   actualprice = re.sub("[\D]", "", productpriceDef)

   newRow = [date, actualseller, bucketactual, actualprice, actualproducttitle]
   results.append_row(newRow)
inject_scraping()

1 个答案:

答案 0 :(得分:0)

您的使用for循环的建议应该可以完成工作。将所有单元格名称放入列表中,然后遍历该列表,如下所示

注意:请确保将硬编码的单元格名称“ A1”替换为“单元格”(如循环中所定义),否则它将仅在该单个(硬编码)单元格上运行。您可能早些时候忘记了这一步。

from selenium import webdriver
import time
from bs4 import BeautifulSoup
import gspread
from oauth2client.service_account import ServiceAccountCredentials
import datetime
import re

scope = ["https://spreadsheets.google.com/feeds",'https://www.googleapis.com/auth/spreadsheets',"https://www.googleapis.com/auth/drive.file","https://www.googleapis.com/auth/drive"]
creds = ServiceAccountCredentials.from_json_keyfile_name("/Users/Jeffrey/Downloads/bolscraper.json", scope)
client = gspread.authorize(creds)
sheet = client.open("Scraper")
results = sheet.sheet1
itemList = sheet.worksheet('LoadThisList')
date = str(datetime.date.today().strftime("%d-%m-%Y"))

cells = ['A1', 'A2', 'A3']

def inject_scraping():
    for cell in cells:
        browser = webdriver.Chrome('/Users/Jeffrey/Downloads/chromedriver')
        browser.get(itemlist.acell(cell).value)
        ## ... Rest of your scraper code ...
        browser.close()

在扩展方面,您可以编写类似的函数来填充“单元格”列表,因此您不必像上面那样对它们进行硬编码。

确保输入browser.close()以关闭网络驱动程序。甚至更好的办法是使用setup()teardown()方法定义一个类,您可以在其中定义这些内容。