刚刚尝试为提交内容解析reddit淋浴的想法并遇到了问题:
path = 'https://www.reddit.com/r/Showerthoughts/'
with requests.Session() as s:
r = s.get(path)
soup = BeautifulSoup(r.content, "lxml")
# print(soup.prettify())
threads = soup.find_all('p')
for thread in threads:
soup = thread
text = soup('a')
try:
print(text[0])
except:
pass
在这段代码中我试图获得每个提交的标题,该标题包含在< p>标签,然后< a>带有“title may-blank”类的标签。但是上面的代码返回所有带有标签的元素,其中有许多标题,甚至认为标题在那里我还需要经历两个soup.findAll()
的交互,我相信有一种较少的手动搜索方式通过汤打印所有的标题
据我所知,我试图做到
titles = soup.findAll( "a", {"class":"title may-blank})
for title in titles:
print(title.string)
但这没用
有什么想法吗? PS我知道这可以使用reddit API来完成并且效率更高,但我想提高我的解析技能,因为它们不是最新的。谢谢你的帮助
答案 0 :(得分:2)
它们是 css 类,您还需要添加用户代理:
import requests
from bs4 import BeautifulSoup
path = 'https://www.reddit.com/r/Showerthoughts/'
headers ={"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36"}
with requests.Session() as s:
r = s.get(path, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
threads = soup.select('a.title.may-blank')
for a in threads:
print(a)
您也可以使用soup.find_all("a", class_="title")
,但这可能比您想要的更多。