我正在尝试抓一个网站,但是当我运行此代码时,它只打印一半数据(包括评论数据)。这是我的剧本:
from bs4 import BeautifulSoup
from urllib.request import urlopen
inputfile = "Chicago.csv"
f = open(inputfile, "w")
Headers = "Name, Link\n"
f.write(Headers)
url = "https://www.chicagoreader.com/chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228"
html = urlopen(url)
soup = BeautifulSoup(html, "html.parser")
page_details = soup.find("dl", {"class":"boccat"})
Readers = page_details.find_all("a")
for i in Readers:
poll = i.contents[0]
link = i['href']
print(poll)
print(link)
f.write("{}".format(poll) + ",https://www.chicagoreader.com{}".format(link)+ "\n")
f.close()
find_all
和find
来获取属性错误。我阅读文档但不明白。答案 0 :(得分:0)
要缩短代码,可以切换到“请求”库。它易于使用且精确。如果你想让它更短,你可以使用cssselect。
find
选择容器,find_all
在for循环中选择该容器的单个项目。这是完成的代码:
from bs4 import BeautifulSoup
import csv ; import requests
outfile = open("chicagoreader.csv","w",newline='')
writer = csv.writer(outfile)
writer.writerow(["Name","Link"])
base = "https://www.chicagoreader.com"
response = requests.get("https://www.chicagoreader.com/chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228")
soup = BeautifulSoup(response.text, "lxml")
for item in soup.select(".boccat dd a"):
writer.writerow([item.text,base + item.get('href')])
print(item.text,base + item.get('href'))
或者使用find和find_all:
from bs4 import BeautifulSoup
import requests
base = "https://www.chicagoreader.com"
response = requests.get("https://www.chicagoreader.com/chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228")
soup = BeautifulSoup(response.text, "lxml")
for items in soup.find("dl",{"class":"boccat"}).find_all("dd"):
item = items.find_all("a")[0]
print(item.text, base + item.get("href"))