我可以运行此代码,但是如果我在文件aliveSlice中添加太多URL,它将出错。我一次可以处理3000个左右的东西,但有3万个列表
我不知道,这是我用Python编写的第一个程序。
网址示例:
https://slicelife.com/restaurants/fl/orlando/32819/triple-pizza-pasta/menu
https://slicelife.com/restaurants/fl/orlando/32819/uno-pizzeria-grill-8250-international-dr-orlando/menu
https://slicelife.com/restaurants/fl/orlando/32820/papa-gio-s-pizza/menu
https://slicelife.com/restaurants/fl/orlando/32821/famas-pizza-pasta/menu
https://slicelife.com/restaurants/fl/orlando/32822/broadway-ristorante-pizzeria/menu
https://slicelife.com/restaurants/fl/orlando/32822/digino-s-new-york-pizzeria/menu
https://slicelife.com/restaurants/fl/orlando/32822/giovannis/menu
https://slicelife.com/restaurants/fl/orlando/32822/i-love-ny-pizza-orlando/menu
https://slicelife.com/restaurants/fl/orlando/32822/mario-s-pizza-subs/menu
https://slicelife.com/restaurants/fl/orlando/32822/muzzarella-pizza-italian-kitchen/menu
https://slicelife.com/restaurants/fl/orlando/32822/napolli-italian-pizzeria/menu
https://slicelife.com/restaurants/fl/orlando/32824/mama-romano-s-orlando/menu
https://slicelife.com/restaurants/fl/orlando/32825/ferrara-pizza-pasta/menu
https://slicelife.com/restaurants/fl/orlando/32825/giovanni-italian-restaurant-pizzeria/menu
https://slicelife.com/restaurants/fl/orlando/32825/italian-village-pizza-orlando/menu
程序
# Open a terminal and input following commands, ensuring that you are in directory that this file is located in within the command environment.
# build the virtual environment:
# python3 -m venv tutorial-env
# activate it
# tutorial-env\Scripts\activate.bat
from bs4 import BeautifulSoup
import requests
import json
import csv
import sys
pizzaArray = []
with open('aliveSlice.csv') as csvf: # Open file in read mode
urls = csv.reader(csvf)
for url in urls:
req = requests.get(url[-1])
content = BeautifulSoup(req.content, "html.parser")
for pizzeria in content.findAll('div', attrs={"class": "f19xeu2d"}):
name = pizzeria.find('h1', attrs={"class": "f13p7rsj"})
address = pizzeria.find('address', attrs={"class": "f1lfckhr"})
phone = pizzeria.find('button', attrs={"class": "f12gt8lx"})
if name and address and phone:
pizzeriaObject = {
"pizzeriaName": name.text,
"address": address.text,
"phoneNumber": phone.text,
}
pizzaArray.append(pizzeriaObject)
else:
print(f"Missing data - {url}")
with open('pizzeriaData.json', 'w', encoding='utf-8') as outfile:
json.dump(pizzaArray, outfile)
输出
[{"pizzeriaName": "Pizzitalia's NY Pizzeria & Italian Restaurant", "address": "6742 Memorial Hwy, Tampa, 33 33515", "phoneNumber": "813-694-1455"}, {"pizzeriaName": "Cecy's Pizza", "address": "1172 Gambell St, Anchorage, AK 99501", "phoneNumber": "907-770-7877"}, {"pizzeriaName": "Fat Ptarmigan", "address": "441 W 5th Ave, Anchorage, AK 99501", "phoneNumber": "907-312-2426"}, {"pizzeriaName": "Glacier Brewhouse", "address": "737 W 5th Ave #110, Anchorage, AK 99501", "phoneNumber": "901-614-1437"}, {"pizzeriaName": "Ski & Benny Pizza", "address": "820 Bilbo St, Anchorage, AK 99501", "phoneNumber": "907-312-1161"}, {"pizzeriaName": "Marco T's Pizzeria", "address": "302 W Fireweed Ln, Anchorage, AK 99503", "phoneNumber": "901-567-8567"}, {"pizzeriaName": "Moose's Tooth Pub & Pizzeria", "address": "3300 Old Seward Hw, Anchorage, AK 99503", "phoneNumber": "901-808-8094"}, {"pizzeriaName": "Palermo Pizza & Philly's", "address": "6406 Debarr Rd, Anchorage, AK 99504", "phoneNumber": "907-334-3354"}, {"pizzeriaName": "Sicily's Pizza", "address": "171 Muldoon Rd #106, Anchorage, AK 99504", "phoneNumber": "906-224-2029"}, {"pizzeriaName": "Sicily's Pizza - Northern Lights ", "address": "2210 E Northern Lights Blvd, Anchorage, AK 99508", "phoneNumber": "901-350-5126"}, {"pizzeriaName": "49th State Brewing Co - Anchorage", "address": "717 W 3rd Ave, Anchorage, AK 99514", "phoneNumber": "901-641-1874"}, {"pizzeriaName": "Sicily's Pizza - East Diamond", "address": "1201 E Dimond Blvd, Anchorage",
不适用,强制关闭
答案 0 :(得分:0)
您在问题中说明的大小相对较小,因此我不知道它们是否确实是问题所在,但是如果问题以这种方式消失,则下面的代码可能值得尝试。您可以单独尝试两种修改。
第一点是逐行读取url文件。如果它仅包含url,也不需要将其作为csv文件进行处理。
with open('aliveSlice.csv', 'rt') as f:
url=f.readline()
req = requests.get(url)
# here you would do your BeautifulSoup processing
第二点可能是,json.dump在收集大量数据时遇到问题。您可以通过将集合切成小块并逐个文档地对其进行处理,来尝试简化它的生活:
with open('pizzeriaData.json', 'wt', encoding='utf-8') as f:
f.write('[')
sep= ''
for doc in json_docs:
f.write(sep)
f.write(json.dumps(doc))
sep=', '
f.write(']')