Question

那里。我正在构建一个简单的抓取工具。这是我拥有的代码。

from bs4 import BeautifulSoup
import requests
from lxml import html
import gspread
from oauth2client.service_account import ServiceAccountCredentials
import datetime

scope = ['https://spreadsheets.google.com/feeds']

credentials = ServiceAccountCredentials.from_json_keyfile_name('Programming 
4 Marketers-File-goes-here.json', scope)

site = 'http://nathanbarry.com/authority/'
hdr = {'User-Agent':'Mozilla/5.0'}
req = requests.get(site, headers=hdr)

soup = BeautifulSoup(req.content)

def getFullPrice(soup):
    divs = soup.find_all('div', id='complete-package')
    price = ""
    for i in divs:
        price = i.a
    completePrice = (str(price).split('$',1)[1]).split('<', 1)[0]
    return completePrice


def getVideoPrice(soup):
    divs = soup.find_all('div', id='video-package')
    price = ""
    for i in divs:
        price = i.a
    videoPrice = (str(price).split('$',1)[1]).split('<', 1)[0]
    return videoPrice

fullPrice = getFullPrice(soup)
videoPrice = getVideoPrice(soup)
date = datetime.date.today()

gc = gspread.authorize(credentials)
wks = gc.open("Authority Tracking").sheet1

row = len(wks.col_values(1))+1

wks.update_cell(row, 1, date)
wks.update_cell(row, 2, fullPrice)
wks.update_cell(row, 3, videoPrice)

此脚本在我的本地计算机上运行。但是，当我将它作为应用程序的一部分部署到Heroku并尝试运行它时，我收到以下错误：

追踪（最近一次通话）：文件＆＃34; /app/.heroku/python/lib/python3.6/site-packages/gspread/client.py" ;,第219行，在put_feed中 r = self.session.put（url，data，headers = headers）文件＆＃34; /app/.heroku/python/lib/python3.6/site-packages/gspread/httpsession.py" ;,第82行，放入 return self.request（＆＃39; PUT＆＃39;，url，params = params，data = data，** kwargs）文件＆＃34; /app/.heroku/python/lib/python3.6/site-packages/gspread/httpsession.py"，第69行，请求中 response.status_code，response.content）） gspread.exceptions.RequestError：（400，＆＃34; 400：b＆＃39; cell_id的查询参数值无效。＆＃39;＆＃34;）

在处理上述异常期间，发生了另一个异常：

追踪（最近一次通话）：文件＆＃34; AuthorityScraper.py＆＃34;，第44行，in wks.update_cell（第1行，第1行）文件＆＃34; /app/.heroku/python/lib/python3.6/site-packages/gspread/models.py"，第517行，在update_cell中 self.client.put_feed（uri，ElementTree.tostring（feed））文件＆＃34; /app/.heroku/python/lib/python3.6/site-packages/gspread/client.py" ;,第221行，在put_feed中如果ex [0] == 403： TypeError：＆＃39; RequestError＆＃39; object不支持索引

您认为可能导致此错误的原因是什么？您对我如何解决它有什么建议吗？

Answer 1

有几件事情在发生：

1）Google表格API返回了一个错误：＆＃34; cell_id＆＃34;的无效查询参数值：

gspread.exceptions.RequestError：（400，＆＃34; 400：b＆＃39; cell_id的查询参数值无效。＆＃39;＆＃34;）

2）gspread中的错误在收到错误后导致异常：

TypeError：＆＃39; RequestError＆＃39; object不支持索引

Python 3从__getitem__中删除了BaseException，gspread错误处理依赖于此UpdateCellError。这并不重要，因为无论如何它都会引发update_cell例外。

我的猜测是您将无效的行号传递给append_row。将一些调试日志记录添加到您的脚本以显示（例如，它正在尝试更新的行）会很有帮助。

最好从零行的工作表开始，然后使用gspread。但是append_row {{1}}中的{{1}}似乎确实存在{{1}}，实际上可能与您遇到的问题相同。

Answer 2

我遇到了同样的问题。 BS4在本地计算机上运行良好。但是，由于某种原因，它在Heroku服务器中太慢了，导致出现错误。

我切换到lxml，现在可以正常工作了。

通过命令安装：

pip install lxml

下面是示例代码段：

from lxml import html
import requests

getpage = requests.get("https://url_here")
gethtmlcontent = html.fromstring(getpage.content)
data = gethtmlcontent.xpath('//div[@class = "class-name"]/text()') 
#this is a sample for fetching data from the dummy div
data = data[0:n] # as per your requirement

#now inject the data into django tmeplate.

为什么这个python脚本在我的本地机器上运行而在Heroku上运行？

2 个答案: