我正在使用BeautifulSoup将Reddit侧边栏中的数据从一些subreddits中拉出来,但每次运行我的脚本时,我的结果都会发生变化。
具体来说,sidebar_urls
的结果从迭代变为迭代;有时它会产生[XYZ.com/abc, XYZ.com/def]
,有时它会返回[XYZ.com/def]
,最后,它有时会返回[]
。
使用下面的代码有什么想法可能会发生这种情况?
sidebar_urls = []
for i in range(0, len(reddit_urls)):
req = urllib.request.Request(reddit_urls[i], headers=headers)
resp = urllib.request.urlopen(req)
soup = BeautifulSoup(resp, 'html.parser')
links = soup.find_all(href=True)
for link in links:
if "XYZ.com" in str(link['href']):
sidebar_urls.append(link['href'])
答案 0 :(得分:0)
看来你有时会得到一个没有侧栏的页面。这可能是因为Reddit将您识别为机器人并返回默认页面而不是您期望的页面。使用User-Agent
字段时,请考虑在请求页面时识别自己:
reddit_urls = [
"https://www.reddit.com/r/leagueoflegends/",
"https://www.reddit.com/r/pokemon/"
]
# Update this to identify yourself
user_agent = "me@example.com"
sidebar_urls = []
for reddit_url in reddit_urls:
response = requests.get(reddit_url, headers={"User-Agent": user_agent})
soup = BeautifulSoup(response.text, "html.parser")
# Find the sidebar tag
side_tag = soup.find("div", {"class": "side"})
if side_tag is None:
print("Could not find a sidebar in page: {}".format(reddit_url))
continue
# Find all links in the sidebar tag
link_tags = side_tag.find_all("a")
for link in link_tags:
link_text = str(link["href"])
sidebar_urls.append(link_text)
print(sidebar_urls)