如果我向csv添加太多网址以进行处理,则我编写的这段代码将失败。如何提高?

时间:2019-07-05 23:44:26

标签: python

我可以运行此代码,但是如果我在文件aliveSlice中添加太多URL,它将出错。我一次可以处理3000个左右的东西,但有3万个列表

我不知道,这是我用Python编写的第一个程序。

网址示例:

https://slicelife.com/restaurants/fl/orlando/32819/triple-pizza-pasta/menu
https://slicelife.com/restaurants/fl/orlando/32819/uno-pizzeria-grill-8250-international-dr-orlando/menu
https://slicelife.com/restaurants/fl/orlando/32820/papa-gio-s-pizza/menu
https://slicelife.com/restaurants/fl/orlando/32821/famas-pizza-pasta/menu
https://slicelife.com/restaurants/fl/orlando/32822/broadway-ristorante-pizzeria/menu
https://slicelife.com/restaurants/fl/orlando/32822/digino-s-new-york-pizzeria/menu
https://slicelife.com/restaurants/fl/orlando/32822/giovannis/menu
https://slicelife.com/restaurants/fl/orlando/32822/i-love-ny-pizza-orlando/menu
https://slicelife.com/restaurants/fl/orlando/32822/mario-s-pizza-subs/menu
https://slicelife.com/restaurants/fl/orlando/32822/muzzarella-pizza-italian-kitchen/menu
https://slicelife.com/restaurants/fl/orlando/32822/napolli-italian-pizzeria/menu
https://slicelife.com/restaurants/fl/orlando/32824/mama-romano-s-orlando/menu
https://slicelife.com/restaurants/fl/orlando/32825/ferrara-pizza-pasta/menu
https://slicelife.com/restaurants/fl/orlando/32825/giovanni-italian-restaurant-pizzeria/menu
https://slicelife.com/restaurants/fl/orlando/32825/italian-village-pizza-orlando/menu

程序


# Open a terminal and input following commands, ensuring that you are in directory that this file is located in within the command environment.
# build the virtual environment:

# python3 -m venv tutorial-env

# activate it

# tutorial-env\Scripts\activate.bat



from bs4 import BeautifulSoup
import requests
import json
import csv
import sys

pizzaArray = []

with open('aliveSlice.csv') as csvf: # Open file in read mode
    urls = csv.reader(csvf)

    for url in urls:
        req = requests.get(url[-1])
        content = BeautifulSoup(req.content, "html.parser")

        for pizzeria in content.findAll('div', attrs={"class": "f19xeu2d"}):
            name = pizzeria.find('h1', attrs={"class": "f13p7rsj"})
            address = pizzeria.find('address', attrs={"class": "f1lfckhr"})
            phone = pizzeria.find('button', attrs={"class": "f12gt8lx"})

            if name and address and phone:
                pizzeriaObject = {
                    "pizzeriaName": name.text,
                    "address": address.text,
                    "phoneNumber": phone.text,
                }

                pizzaArray.append(pizzeriaObject)
            else:
                print(f"Missing data - {url}")


with open('pizzeriaData.json', 'w', encoding='utf-8') as outfile:
    json.dump(pizzaArray, outfile)


输出


[{"pizzeriaName": "Pizzitalia's NY Pizzeria & Italian Restaurant", "address": "6742 Memorial Hwy, Tampa, 33 33515", "phoneNumber": "813-694-1455"}, {"pizzeriaName": "Cecy's Pizza", "address": "1172 Gambell St, Anchorage, AK 99501", "phoneNumber": "907-770-7877"}, {"pizzeriaName": "Fat Ptarmigan", "address": "441 W 5th Ave, Anchorage, AK 99501", "phoneNumber": "907-312-2426"}, {"pizzeriaName": "Glacier Brewhouse", "address": "737 W 5th Ave #110, Anchorage, AK 99501", "phoneNumber": "901-614-1437"}, {"pizzeriaName": "Ski & Benny Pizza", "address": "820 Bilbo St, Anchorage, AK 99501", "phoneNumber": "907-312-1161"}, {"pizzeriaName": "Marco T's Pizzeria", "address": "302 W Fireweed Ln, Anchorage, AK 99503", "phoneNumber": "901-567-8567"}, {"pizzeriaName": "Moose's Tooth Pub & Pizzeria", "address": "3300 Old Seward Hw, Anchorage, AK 99503", "phoneNumber": "901-808-8094"}, {"pizzeriaName": "Palermo Pizza & Philly's", "address": "6406 Debarr Rd, Anchorage, AK 99504", "phoneNumber": "907-334-3354"}, {"pizzeriaName": "Sicily's Pizza", "address": "171 Muldoon Rd #106, Anchorage, AK 99504", "phoneNumber": "906-224-2029"}, {"pizzeriaName": "Sicily's Pizza - Northern Lights ", "address": "2210 E Northern Lights Blvd, Anchorage, AK 99508", "phoneNumber": "901-350-5126"}, {"pizzeriaName": "49th State Brewing Co - Anchorage", "address": "717 W 3rd Ave, Anchorage, AK 99514", "phoneNumber": "901-641-1874"}, {"pizzeriaName": "Sicily's Pizza - East Diamond", "address": "1201 E Dimond Blvd, Anchorage",

不适用,强制关闭

1 个答案:

答案 0 :(得分:0)

您在问题中说明的大小相对较小,因此我不知道它们是否确实是问题所在,但是如果问题以这种方式消失,则下面的代码可能值得尝试。您可以单独尝试两种修改。

第一点是逐行读取url文件。如果它仅包含url,也不需要将其作为csv文件进行处理。

with open('aliveSlice.csv', 'rt') as f:
    url=f.readline()
    req = requests.get(url)
    # here you would do your BeautifulSoup processing

第二点可能是,json.dump在收集大量数据时遇到问题。您可以通过将集合切成小块并逐个文档地对其进行处理,来尝试简化它的生活:

with open('pizzeriaData.json', 'wt', encoding='utf-8') as f:
    f.write('[')
    sep= ''
    for doc in json_docs:
        f.write(sep)
        f.write(json.dumps(doc))
        sep=', '
    f.write(']')