Python-'NoneType'对象没有属性'find_next_sibling'

时间:2018-06-20 04:51:47

标签: python web-scraping beautifulsoup wikipedia

我试图制作一个Wikipedia爬网程序,该爬网程序将获取“另请参阅”链接文本,然后输入标签链接到的URL。但是,本文的“另请参见”部分(这是一个无组织的列表)没有任何类或ID,因此我使用方法“ find_next_sibling”来获取它。接下来,它将遍历那里每个链接的Wikipedia页面,并执行相同的操作。这是我的代码:

import requests
from bs4 import BeautifulSoup


def wikipediaCrawler(page, maxPages):

    pageNumber = 1
    while pageNumber < maxPages:
        url = "https://en.wikipedia.org" + page
        sourceCode = requests.get(url)
        print(sourceCode)
        plainText = sourceCode.text
        soup = BeautifulSoup(plainText, "html.parser")
        ul = soup.find("h2", text="See also").find_next_sibling("ul")
        for li in ul.findAll("li"):
            print(li.get_text())
        for link in ul.findAll('a'):
            page = str(link.get('href'))
            print(page)
        pageNumber += 1


wikipediaCrawler("/wiki/Online_chat", 3)

它将正常打印第一页。 问题是,每当尝试切换页面时,都会出现此错误:

Traceback (most recent call last):
  File "C:/Users/Shaman/PycharmProjects/WebCrawler/main.py", line 23, in <module>
    wikipediaCrawler("/wiki/Online_chat", 3)
  File "C:/Users/Shaman/PycharmProjects/WebCrawler/main.py", line 14, in wikipediaCrawler
    ul = soup.find("h2", text="See also").find_next_sibling("ul")
AttributeError: 'NoneType' object has no attribute 'find_next_sibling'

我打印了requests函数,它显示“ Response <200>”,因此它似乎不是权限问题。老实说,我不知道为什么会这样。有任何想法吗?预先感谢

编辑:我知道它搜索的Wikipedia文章都包含带有文本“另请参见”的标记。在这种情况下,它获取了“ Voice_chat”文章,尽管在那里没有找到任何东西。

2 个答案:

答案 0 :(得分:0)

我认为您希望在<ul>标记之后的h2开头“请参见”部分。

找到h2的一种方法是使用CSS selectors找到正确的标签,然后获取父元素(h2),然后从那里获取下一个同级:

def wikipediaCrawler(page, maxPages):

    #...

    soup = BeautifulSoup(plainText, "html.parser")

    see_also = soup.select("h2 > #See_also")[0]
    ul = see_also.parent.find_next_sibling("ul")

    for link in ul.findAll('a'):
        page = str(link.get('href'))
        print(page)

wikipediaCrawler("/wiki/Online_chat", 3)

输出:

/wiki/Chat_room
/wiki/Collaborative_software
/wiki/Instant_messaging
/wiki/Internet_forum
/wiki/List_of_virtual_communities_with_more_than_100_million_active_users
/wiki/Online_dating_service
/wiki/Real-time_text
/wiki/Videotelephony
/wiki/Voice_chat
/wiki/Comparison_of_VoIP_software
/wiki/Massively_multiplayer_online_game
/wiki/Online_game
/wiki/Video_game_culture

答案 1 :(得分:-1)

这段代码soup.find("h2", text="See also")有时找不到元素,然后返回None

快速解决方案是传递错误:

import requests
from bs4 import BeautifulSoup


def wikipediaCrawler(page, maxPages):
    pageNumber = 1
    while pageNumber < maxPages:
        try:
        url = "https://en.wikipedia.org" + page
        sourceCode = requests.get(url)
        print(sourceCode)
        plainText = sourceCode.text
        soup = BeautifulSoup(plainText, "html.parser")
        ul = soup.find("h2", text="See also").find_next_sibling("ul")
        for li in ul.findAll("li"):
            print('li: ', pageNumber, li.get_text())
        for link in ul.findAll('a'):
            page = str(link.get('href'))
            print('a:', pageNumber, page)
    except Exception, e:
        print e
        print soup.find("h2", text="See also")

    pageNumber += 1

wikipediaCrawler("/wiki/Online_chat", 3)

我在打印中添加了一个小的更改以使调试更容易。