下面是我的代码,用美丽的汤刮一个网站..代码在Windows上运行良好但在ubuntu上有问题。在ubuntu中,代码有时会运行,有时会出错。
错误如下:
Traceback (most recent call last):
File "Craftsvilla.py", line 22, in <module>
source = requests.get(new_url)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 70, in get
return request('get', url, params=params, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 56, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 488, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 609, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 487, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.craftsvilla.com', port=80): Max retries exceeded with url: /shop/01-princess-ayesha-cotton-salwar-suit-for-rudra-house/5601472 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f6685fc3310>: Failed to establish a new connection: [Errno -2] Name or service not known',))
以下是我的代码:
import requests
import lxml
from bs4 import BeautifulSoup
import xlrd
import xlwt
file_location = "/home/nitink/Python Linux/BeautifulSoup/Craftsvilla/Craftsvilla.xlsx"
workbook = xlrd.open_workbook(file_location)
sheet = workbook.sheet_by_index(0)
products = []
for r in range(sheet.nrows):
products.append(sheet.cell_value(r,0))
book = xlwt.Workbook(encoding= "utf-8", style_compression = 0)
sheet = book.add_sheet("Sheet11", cell_overwrite_ok=True)
for index, url in enumerate(products):
new_url = "http://www." + url
source = requests.get(new_url)
data = source.content
soup = BeautifulSoup(data, "lxml")
sheet.write(index, 0, url)
try:
Product_Name = soup.select(".product-title")[0].text.strip()
sheet.write(index, 1, Product_Name)
except Exception:
sheet.write(index, 1, "")
book.save("Craftsvilla Output.xls")
将以下链接保存为Craftsvilla.xlsx
craftsvilla.com/shop/01-princess-ayesha-cotton-salwar-suit-for-rudra-house/5601472
craftsvilla.com/shop/3031-pista-prachi/3715170
craftsvilla.com/shop/795-peach-colored-stright-salwar-suit/5608295
craftsvilla.com/catalog/product/view/id/5083511/s/dharm-fashion-villa-embroidery-navy-blue-slawar-suit-gown
注意:对于某些人来说,代码会运行,但尝试一段时间..相同的代码会给出错误..不知道为什么?? ..和相同的代码永远不会给出任何错误在窗户上。
答案 0 :(得分:2)
您似乎经常访问该网站并且服务器拒绝您的请求。成为good web-scraping citizen并在后续请求之间添加时间延迟:
import time
for index, url in enumerate(products):
new_url = "http://www." + url
source = requests.get(new_url)
data = source.content
soup = BeautifulSoup(data, "lxml")
# ...
time.sleep(1) # one second delay