使用BeautifulSoup刮取文本 - NoneType错误

时间:2016-12-20 03:16:23

标签: python web-scraping wikipedia

我正在尝试从维基百科获取表格数据,但我一直收到错误

AttributeError: 'NoneType' object has no attribute 'findAll'

这是我的代码。

from bs4 import BeautifulSoup
import urllib
import urllib.request



wiki = "https://en.wikipedia.org/wiki/List_of_current_United_States_Senators"
page = urllib.request.urlopen(wiki)
soup = BeautifulSoup(page, "lxml")

name = ""
party = ""
state = ""
picture = ""
link = ""
district = ""

table = soup.find("table", { "class" : "wikitable sortable" })

f = open('output.csv', 'w')

for row in table.findAll("tr"):
    cells = row.findAll("td")


    state = cells[0].find(text=True)
    picture = cells[2].findAll(text=True)
    name = cells[3].find(text=True)
    party = cells[4].find(text=True)


    write_to_file = name + "," + state + "," + party + "," + link + "," + picture + "," + district + "\n"
    print (write_to_file)
    f.write(write_to_file)

f.close()

任何帮助,甚至是另一种方式(考虑使用wiki api,但我很遗憾地使用了什么),我们将不胜感激。

2 个答案:

答案 0 :(得分:0)

您面临的主要问题是soup.find("table", { "class" : "wikitable sortable" })会返回None。但是,有一个类sortable wikitable sortable的元素,也许你想要那个元素。

我修复了该问题并添加了if和一些print s。它仍然不起作用,但我猜这个问题更容易解决。现在轮到你了:))

from bs4 import BeautifulSoup
import urllib
import urllib.request

wiki =  "https://en.wikipedia.org/wiki/List_of_current_United_States_Senators"
page = urllib.request.urlopen(wiki)
soup = BeautifulSoup(page, "lxml")

name = ""
party = ""
state = ""
picture = ""
link = ""
district = ""

table = soup.find("table", { "class" : "sortable wikitable sortable" })

f = open('output.csv', 'w')

for row in table.findAll("tr"):
    cells = row.findAll("td")
    if cells:
        state = cells[0].find(text=True)
        picture = cells[2].findAll(text=True)
        name = cells[3].find(text=True)
        party = cells[4].find(text=True)

        print(state, type(state))
        print(picture, type(picture))
        print(name, type(name))
        print(party, type(party))
        write_to_file = name + "," + state + "," + party + "," + link + "," + picture + "," + district + "\n"
        print (write_to_file)
        f.write(write_to_file)
        f.flush()

f.close()

答案 1 :(得分:0)

NA 3 NA b

打印:

import bs4, requests

base_url = 'https://en.wikipedia.org/wiki/List_of_current_United_States_Senators'
response = requests.get(base_url)
soup = bs4.BeautifulSoup(response.text, 'lxml')

with open('out.txt', 'w', newline='') as out:
    writer = csv.writer(out)
    for row in table('tr'):
        row_text = [td.get_text(strip=True) for td in row('td') if td.text ]
        writer.writerow(row_text)
        print(row_text)

out.txt:

[]
['Alabama', '3', 'Shelby, RichardRichard Shelby', 'Republican', 'None', 'U.S. House,Alabama Senate', 'University of Alabama, Tuscaloosa(BA;LLB)Birmingham School of Law(JD)', 'January 3, 1987', '(1934-05-06)May 6, 1934(age\xa082)', '2022']
['Alabama', '2', 'Sessions, JeffJeff Sessions', 'Republican', 'Lawyer in private practice', 'Alabama Attorney General,U.S. Attorneyfor theSouthern District of Alabama', 'Huntingdon College(BA)University of Alabama, Tuscaloosa(JD)', 'January 3, 1997', '(1946-12-24)December 24, 1946(age\xa069)', '2020']
['Alaska', '3', 'Murkowski, LisaLisa Murkowski', 'Republican', 'Lawyer in private practice', 'Alaska House', 'Georgetown University(BA)Willamette University(JD)', 'December 20, 2002', '(1957-05-22)May 22, 1957(age\xa059)', '2022']
['Alaska', '2', 'Sullivan, DanDan Sullivan', 'Republican', 'Lawyer in private practice', 'Alaska Natural Resources Commissioner,Alaska Attorney General,U.S. Assistant Secretary of State for Economic and Business Affairs', 'Harvard University(BA)Georgetown University(MS;JD)', 'January 3, 2015', '(1964-11-13)November 13, 1964(age\xa052)', '2020']