BeautifulSoup产生不一致的结果

时间:2018-06-09 02:39:00

标签: python beautifulsoup

我正在使用BeautifulSoup将Reddit侧边栏中的数据从一些subreddits中拉出来,但每次运行我的脚本时,我的结果都会发生变化。

具体来说,sidebar_urls的结果从迭代变为迭代;有时它会产生[XYZ.com/abc, XYZ.com/def],有时它会返回[XYZ.com/def],最后,它有时会返回[]

使用下面的代码有什么想法可能会发生这种情况?

sidebar_urls = []

for i in range(0, len(reddit_urls)):
    req = urllib.request.Request(reddit_urls[i], headers=headers)
    resp = urllib.request.urlopen(req)
    soup = BeautifulSoup(resp, 'html.parser')

    links = soup.find_all(href=True)

    for link in links:
        if "XYZ.com" in str(link['href']):
            sidebar_urls.append(link['href'])

1 个答案:

答案 0 :(得分:0)

看来你有时会得到一个没有侧栏的页面。这可能是因为Reddit将您识别为机器人并返回默认页面而不是您期望的页面。使用User-Agent字段时,请考虑在请求页面时识别自己:

reddit_urls = [
    "https://www.reddit.com/r/leagueoflegends/",
    "https://www.reddit.com/r/pokemon/"
]

# Update this to identify yourself
user_agent = "me@example.com"

sidebar_urls = []
for reddit_url in reddit_urls:
    response = requests.get(reddit_url, headers={"User-Agent": user_agent})
    soup = BeautifulSoup(response.text, "html.parser")

    # Find the sidebar tag
    side_tag = soup.find("div", {"class": "side"})
    if side_tag is None:
        print("Could not find a sidebar in page: {}".format(reddit_url))
        continue

    # Find all links in the sidebar tag
    link_tags = side_tag.find_all("a")
    for link in link_tags:
        link_text = str(link["href"])
        sidebar_urls.append(link_text)

print(sidebar_urls)