Question

我正在尝试抓一个网站，但是当我运行此代码时，它只打印一半数据（包括评论数据）。这是我的剧本：

from bs4 import BeautifulSoup
from urllib.request import urlopen

inputfile = "Chicago.csv"
f = open(inputfile, "w")
Headers = "Name, Link\n"
f.write(Headers)

url = "https://www.chicagoreader.com/chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228"
html = urlopen(url)
soup = BeautifulSoup(html, "html.parser")

page_details = soup.find("dl", {"class":"boccat"})
Readers = page_details.find_all("a")

for i in Readers:
    poll = i.contents[0]
    link = i['href']
    print(poll)
    print(link)
    f.write("{}".format(poll) + ",https://www.chicagoreader.com{}".format(link)+ "\n")
f.close()

我的脚本风格错了吗？
如何缩短代码？
何时使用find_all和find来获取属性错误。我阅读文档但不明白。

Answer 1

要缩短代码，可以切换到“请求”库。它易于使用且精确。如果你想让它更短，你可以使用cssselect。

find选择容器，find_all在for循环中选择该容器的单个项目。这是完成的代码：

from bs4 import BeautifulSoup
import csv ; import requests

outfile = open("chicagoreader.csv","w",newline='')
writer = csv.writer(outfile)
writer.writerow(["Name","Link"])

base = "https://www.chicagoreader.com"

response = requests.get("https://www.chicagoreader.com/chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228")
soup = BeautifulSoup(response.text, "lxml")
for item in soup.select(".boccat dd a"):
    writer.writerow([item.text,base + item.get('href')])
    print(item.text,base + item.get('href'))

或者使用find和find_all：

from bs4 import BeautifulSoup
import requests

base = "https://www.chicagoreader.com"

response = requests.get("https://www.chicagoreader.com/chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228")
soup = BeautifulSoup(response.text, "lxml")
for items in soup.find("dl",{"class":"boccat"}).find_all("dd"):
    item = items.find_all("a")[0]
    print(item.text, base + item.get("href"))

BeautifulSoup不会刮掉所有数据

1 个答案: