Question

我正在以下网站上练习我的网络抓取技巧：＆＃34; http://web.californiacraftbeer.com/Brewery-Member＆＃34;

我到目前为止的代码如下。看起来我得到了正确的公司数，但我在CSV文件中得到了重复的行，我认为只要公司缺少信息就会出现这种情况。在我的代码的多个部分中，我试图用文本＆＃34; N / A＆＃34;来检测和替换缺失的信息，但它不起作用。我猜这个问题可能与Zip（）函数有关，但我不确定如何修复它。

非常感谢任何帮助！

"""
Grabs brewery name, contact person, phone number, website address, and email address 
for each brewery listed on the website.
"""

import requests, csv
from bs4 import BeautifulSoup

url = "http://web.californiacraftbeer.com/Brewery-Member"
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
each_company = soup.find_all("div", {"class": "ListingResults_All_CONTAINER ListingResults_Level3_CONTAINER"})
error_msg = "N/A" 

def scraper():
    """Grabs information and writes to CSV"""
    print("Running...")
    results = []
    count = 0

    for info in each_company:
        try:
            company_name = info.find_all("span", itemprop="name")
        except Exception as e:
            company_name = "N/A"
        try:
            contact_name = info.find_all("div", {"class": "ListingResults_Level3_MAINCONTACT"})
        except Exception as e:
            contact_name = "N/A"
        try:
            phone_number = info.find_all("div", {"class": "ListingResults_Level3_PHONE1"})
        except Exception as e:
            phone_number = "N/A"
        try:
            website = info.find_all("span", {"class": "ListingResults_Level3_VISITSITE"})
        except Exception as e:
            website = "N/A"

        for company, contact, phone, site in zip(company_name, contact_name, phone_number, website):
            count += 1
            print("Grabbing {0} ({1})...".format(company.text, count))
            newrow = []
            try:
                newrow.append(company.text)
            except Exception as e:
                newrow.append(error_msg)
            try:
                newrow.append(contact.text)
            except Exception as e:
                newrow.append(error_msg)
            try:
                newrow.append(phone.text)
            except Exception as e:
                newrow.append(error_msg)
            try:
                newrow.append(site.find('a')['href'])
            except Exception as e:
                newrow.append(error_msg)
            try:
                newrow.append("info@" + company.text.replace(" ", "").lower() + ".com")
            except Exception as e:
                newrow.append(error_msg)
        results.append(newrow)

    print("Done")
    outFile = open("brewery.csv", "w")
    out = csv.writer(outFile, delimiter=',',quoting=csv.QUOTE_ALL, lineterminator='\n')
    out.writerows(results)
    outFile.close()

def main():
    """Runs web scraper"""
    scraper()

if __name__ == '__main__':
    main()

Answer 1

来自bs4 docs

“如果find_all（）找不到任何内容，则返回一个空列表。如果 find（）找不到任何东西，它返回None“

因此，例如，当company_name = info.find_all("span", itemprop="name")运行且不匹配任何内容时，它不会抛出异常而"NA"永远不会被设置。

在这种情况下，您需要检查company_name是否为空列表：

if not company_name:
    company_name = "N/A"

BeautifulSoup填写缺少的信息＆＃34; N / A＆＃34;不工作

1 个答案: