BeautifulSoup填写缺少的信息" N / A"不工作

时间:2017-02-16 07:39:32

标签: python csv beautifulsoup

我正在以下网站上练习我的网络抓取技巧:" http://web.californiacraftbeer.com/Brewery-Member"

我到目前为止的代码如下。看起来我得到了正确的公司数,但我在CSV文件中得到了重复的行,我认为只要公司缺少信息就会出现这种情况。在我的代码的多个部分中,我试图用文本" N / A"来检测和替换缺失的信息,但它不起作用。我猜这个问题可能与Zip()函数有关,但我不确定如何修复它。

非常感谢任何帮助!

"""
Grabs brewery name, contact person, phone number, website address, and email address 
for each brewery listed on the website.
"""

import requests, csv
from bs4 import BeautifulSoup

url = "http://web.californiacraftbeer.com/Brewery-Member"
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
each_company = soup.find_all("div", {"class": "ListingResults_All_CONTAINER ListingResults_Level3_CONTAINER"})
error_msg = "N/A" 

def scraper():
    """Grabs information and writes to CSV"""
    print("Running...")
    results = []
    count = 0

    for info in each_company:
        try:
            company_name = info.find_all("span", itemprop="name")
        except Exception as e:
            company_name = "N/A"
        try:
            contact_name = info.find_all("div", {"class": "ListingResults_Level3_MAINCONTACT"})
        except Exception as e:
            contact_name = "N/A"
        try:
            phone_number = info.find_all("div", {"class": "ListingResults_Level3_PHONE1"})
        except Exception as e:
            phone_number = "N/A"
        try:
            website = info.find_all("span", {"class": "ListingResults_Level3_VISITSITE"})
        except Exception as e:
            website = "N/A"

        for company, contact, phone, site in zip(company_name, contact_name, phone_number, website):
            count += 1
            print("Grabbing {0} ({1})...".format(company.text, count))
            newrow = []
            try:
                newrow.append(company.text)
            except Exception as e:
                newrow.append(error_msg)
            try:
                newrow.append(contact.text)
            except Exception as e:
                newrow.append(error_msg)
            try:
                newrow.append(phone.text)
            except Exception as e:
                newrow.append(error_msg)
            try:
                newrow.append(site.find('a')['href'])
            except Exception as e:
                newrow.append(error_msg)
            try:
                newrow.append("info@" + company.text.replace(" ", "").lower() + ".com")
            except Exception as e:
                newrow.append(error_msg)
        results.append(newrow)

    print("Done")
    outFile = open("brewery.csv", "w")
    out = csv.writer(outFile, delimiter=',',quoting=csv.QUOTE_ALL, lineterminator='\n')
    out.writerows(results)
    outFile.close()

def main():
    """Runs web scraper"""
    scraper()

if __name__ == '__main__':
    main()

1 个答案:

答案 0 :(得分:1)

来自bs4 docs

  

“如果find_all()找不到任何内容,则返回一个空列表。如果   find()找不到任何东西,它返回None“

因此,例如,当company_name = info.find_all("span", itemprop="name")运行且不匹配任何内容时,它不会抛出异常而"NA"永远不会被设置。

在这种情况下,您需要检查company_name是否为空列表:

if not company_name:
    company_name = "N/A"