我正在以下网站上练习我的网络抓取技巧:" http://web.californiacraftbeer.com/Brewery-Member"
我到目前为止的代码如下。看起来我得到了正确的公司数,但我在CSV文件中得到了重复的行,我认为只要公司缺少信息就会出现这种情况。在我的代码的多个部分中,我试图用文本" N / A"来检测和替换缺失的信息,但它不起作用。我猜这个问题可能与Zip()函数有关,但我不确定如何修复它。
非常感谢任何帮助!
"""
Grabs brewery name, contact person, phone number, website address, and email address
for each brewery listed on the website.
"""
import requests, csv
from bs4 import BeautifulSoup
url = "http://web.californiacraftbeer.com/Brewery-Member"
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
each_company = soup.find_all("div", {"class": "ListingResults_All_CONTAINER ListingResults_Level3_CONTAINER"})
error_msg = "N/A"
def scraper():
"""Grabs information and writes to CSV"""
print("Running...")
results = []
count = 0
for info in each_company:
try:
company_name = info.find_all("span", itemprop="name")
except Exception as e:
company_name = "N/A"
try:
contact_name = info.find_all("div", {"class": "ListingResults_Level3_MAINCONTACT"})
except Exception as e:
contact_name = "N/A"
try:
phone_number = info.find_all("div", {"class": "ListingResults_Level3_PHONE1"})
except Exception as e:
phone_number = "N/A"
try:
website = info.find_all("span", {"class": "ListingResults_Level3_VISITSITE"})
except Exception as e:
website = "N/A"
for company, contact, phone, site in zip(company_name, contact_name, phone_number, website):
count += 1
print("Grabbing {0} ({1})...".format(company.text, count))
newrow = []
try:
newrow.append(company.text)
except Exception as e:
newrow.append(error_msg)
try:
newrow.append(contact.text)
except Exception as e:
newrow.append(error_msg)
try:
newrow.append(phone.text)
except Exception as e:
newrow.append(error_msg)
try:
newrow.append(site.find('a')['href'])
except Exception as e:
newrow.append(error_msg)
try:
newrow.append("info@" + company.text.replace(" ", "").lower() + ".com")
except Exception as e:
newrow.append(error_msg)
results.append(newrow)
print("Done")
outFile = open("brewery.csv", "w")
out = csv.writer(outFile, delimiter=',',quoting=csv.QUOTE_ALL, lineterminator='\n')
out.writerows(results)
outFile.close()
def main():
"""Runs web scraper"""
scraper()
if __name__ == '__main__':
main()
答案 0 :(得分:1)
来自bs4 docs
“如果find_all()找不到任何内容,则返回一个空列表。如果 find()找不到任何东西,它返回None“
因此,例如,当company_name = info.find_all("span", itemprop="name")
运行且不匹配任何内容时,它不会抛出异常而"NA"
永远不会被设置。
在这种情况下,您需要检查company_name
是否为空列表:
if not company_name:
company_name = "N/A"