无法抓取Reddit的NBA页面

时间:2017-10-18 04:56:12

标签: python web-scraping beautifulsoup

我是网络抓取新手,想学习如何使用beautifulsoup将其集成到迷你项目中。我在youtube channel跟随关于beautifulsoup的newsboston教程,然后试图抓住Reddit。我想在Reddit/r/nba抓取每个NBA新闻的标题和链接,但没有取得任何成功。只有在终端返回的东西是"进程以退出代码0和#34;结束。我觉得这与我的选择有关吗?任何指导和帮助将不胜感激。

这是原始代码,没有用:

import requests
from bs4 import BeautifulSoup

def spider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'https://reddit.com/r/nba' + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        for link in soup.find_all('a', {'class': 'title'}):
            href = link.get('href')
            print(href)
        page += 1

spider(1)

我尝试过这样做,但这并没有解决问题:

import requests
from bs4 import BeautifulSoup

def spider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'https://www.reddit.com/r/nba/' + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        for link in soup.findAll('a', {'class': 'title'}):
            href = "https://www.reddit.com/" + link.get('href')
            title = link.string
            print(href)
            print(title)
        page += 1

spider(1)

1 个答案:

答案 0 :(得分:0)

在主页上获取标题和链接:

from bs4 import BeautifulSoup
from urllib.request import urlopen

html = urlopen("https://www.reddit.com/r/nba/")
soup = BeautifulSoup(html, 'lxml')
for link in soup.find('div', {'class':'content'}).find_all('a', {'class':'title may-blank outbound'}):
    print(link.attrs['href'], link.get_text())