BeautifulSoup不会刮掉所有数据

时间:2017-09-14 10:34:35

标签: web-scraping beautifulsoup python-3.6

我正在尝试抓一个网站,但是当我运行此代码时,它只打印一半数据(包括评论数据)。这是我的剧本:

from bs4 import BeautifulSoup
from urllib.request import urlopen

inputfile = "Chicago.csv"
f = open(inputfile, "w")
Headers = "Name, Link\n"
f.write(Headers)

url = "https://www.chicagoreader.com/chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228"
html = urlopen(url)
soup = BeautifulSoup(html, "html.parser")

page_details = soup.find("dl", {"class":"boccat"})
Readers = page_details.find_all("a")

for i in Readers:
    poll = i.contents[0]
    link = i['href']
    print(poll)
    print(link)
    f.write("{}".format(poll) + ",https://www.chicagoreader.com{}".format(link)+ "\n")
f.close()
  1. 我的脚本风格错了吗?
  2. 如何缩短代码?
  3. 何时使用find_allfind来获取属性错误。我阅读文档但不明白。

1 个答案:

答案 0 :(得分:0)

要缩短代码,可以切换到“请求”库。它易于使用且精确。如果你想让它更短,你可以使用cssselect。

find选择容器,find_all在for循环中选择该容器的单个项目。这是完成的代码:

from bs4 import BeautifulSoup
import csv ; import requests

outfile = open("chicagoreader.csv","w",newline='')
writer = csv.writer(outfile)
writer.writerow(["Name","Link"])

base = "https://www.chicagoreader.com"

response = requests.get("https://www.chicagoreader.com/chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228")
soup = BeautifulSoup(response.text, "lxml")
for item in soup.select(".boccat dd a"):
    writer.writerow([item.text,base + item.get('href')])
    print(item.text,base + item.get('href'))

或者使用find和find_all:

from bs4 import BeautifulSoup
import requests

base = "https://www.chicagoreader.com"

response = requests.get("https://www.chicagoreader.com/chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228")
soup = BeautifulSoup(response.text, "lxml")
for items in soup.find("dl",{"class":"boccat"}).find_all("dd"):
    item = items.find_all("a")[0]
    print(item.text, base + item.get("href"))