Question

我是一名练习Python的高中生。对于最终项目，我想使用网络抓取（我们在课堂上没有涉及）。以下是我的代码，该代码应该询问用户的出生日期，然后打印出一份分享他们生日的名人名单（不包括他们的出生年份）。

import requests
from bs4 import BeautifulSoup

print("Please enter your birthday:")
BD_Day = input("Day: ")
BD_Month = input("Month (1-12): ")
Months = ('January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December')
Month = dict(zip(range(12), Months))
BD_Month = int(BD_Month)
messy_url = ['http://www.famousbirthdays.com/', Month[BD_Month - 1], BD_Day, '.html']
url = ''.join(messy_url)
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
spans = soup.find_all('span', attrs={'class':'title'})
for span in spans:
    print (span.string)

该代码应该搜索定义为“url”的网页，但是，它总是打印出11月6日出生的人员列表：

Lauren Orlando
Emma Stone
Alastair Aiken
Sal Vulcano
Bailey Ballinger

该代码也只在页面上打印5/48个名字，打印1-6（奇怪的是不包括5个）。

我的两个主要问题是日期和不完整的名单 - 任何输入都会受到赞赏。

感谢。

Answer 1

我会说您的错误是来自URL还是来自span标记，因为该网站在a元素内的div元素内容纳了所有人。

所以，这是我的做法：

import requests
from bs4 import BeautifulSoup

#ask for birthday
print("Please enter your birthday:")
BD_Day = input("Day: ")
BD_Month = input("Month (1-12): ")
Months = ('January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December')

#make URL
url = "https://www.famousbirthdays.com/" + str(Months[int(BD_Month) - 1].lower() + BD_Day) + ".html"

#make HTTP request
response = requests.get(url=url)

#parse HTML
page = BeautifulSoup(response.content, 'html.parser')

#find list of all people based on website's HTML
all_people = page.find("div",{"class":"people-list"}).find_all("a",{"class":"person-item"})

#show all people
for person in all_people:
    print(person.find("div",{"class":"info"}).find("div",{"class":"name"}).get_text().strip())

希望我能帮助您！

麻烦网上抓取Python

1 个答案: