Question

我正在使用python和beautifulsoup来刮一张桌子......我有很好的处理方法来获取我需要的大部分信息。缩短了我想要抓的东西。

<tr> <td><a href="/wiki/Joseph_Carter_Abbott" title="Joseph Carter Abbott">Joseph Carter  Abbott</a></td> <td>1868–1872</td> <td>North Carolina</td> <td><a href="/wiki/Republican_Party_(United_States)" title="Republican Party (United States)">Republican</a></td>
</tr> 
<tr> <td><a href="/wiki/James_Abdnor" title="James Abdnor">James Abdnor</a></td> <td>1981–1987</td> <td>South Dakota</td> <td><a href="/wiki/Republican_Party_(United_States)" title="Republican Party (United States)">Republican</a></td> </tr> <tr> <td><a href="/wiki/Hazel_Abel" title="Hazel Abel">Hazel Abel</a></td> <td>1954</td> <td>Nebraska</td> <td><a href="/wiki/Republican_Party_(United_States)" title="Republican Party (United States)">Republican</a></td> 
</tr>

http://en.wikipedia.org/wiki/List_of_former_United_States_senators

我想要姓名，描述，年份，州，党。

描述是每个人页面上的第一段文字。我知道如何独立完成这项工作，但我不知道如何将它与名称，年份，州，派对相结合，因为我必须导航到另一页。

哦，我需要将它写入csv。

谢谢！

Answer 1

只是为了阐述@ anrosent的答案：在解析过程中发送请求是最好和最一致的方法之一。但是，获取描述的函数也必须正常运行，因为如果它返回NoneType错误，则整个过程将变为混乱。

我在这方面做到这一点的方式就是这个（请注意我使用的是Requests库，而不是urllib或urllib2，因为我对此感觉更舒服 - 随意根据自己的喜好进行更改，逻辑是无论如何）：

from bs4 import BeautifulSoup as bsoup
import requests as rq
import csv

ofile = open("presidents.csv", "wb")
f = csv.writer(ofile)
f.writerow(["Name","Description","Years","State","Party"])
base_url = "http://en.wikipedia.org/wiki/List_of_former_United_States_senators"
r = rq.get(base_url)
soup = bsoup(r.content)
all_tables = soup.find_all("table", class_="wikitable")

def get_description(url):
    r = rq.get(url)
    soup = bsoup(r.content)
    desc = soup.find_all("p")[0].get_text().strip().encode("utf-8")
    return desc

complete_list = []
for table in all_tables:
    trs = table.find_all("tr")[1:] # Ignore the header row.
    for tr in trs:
        tds = tr.find_all("td")
        first = tds[0].find("a") 
        name = first.get_text().encode("utf-8")
        desc = get_description("http://en.wikipedia.org%s" % first["href"])
        years = tds[1].get_text().encode("utf-8")
        state = tds[2].get_text().encode("utf-8")
        party = tds[3].get_text().encode("utf-8")
        f.writerow([name, desc, years, state, party])

ofile.close()

但是，此尝试在David Barton之后的行结束。如果你检查页面，也许它与他占据两条线自己有关。这取决于你修复。回溯如下：

Traceback (most recent call last):
  File "/home/nanashi/Documents/Python 2.7/Scrapers/presidents.py", line 25, in <module>
    name = first.get_text().encode("utf-8")
AttributeError: 'NoneType' object has no attribute 'get_text'

另外，请注意我的get_description函数在主进程之前的状态。这显然是因为您必须首先定义函数。最后，我的get_description函数不够完美，因为如果某个页面中的第一个p标记不是您想要的那个，它可能会失败。

结果样本：

enter image description here

注意错误的路线，如Maryon Allen的描述。这也是你要解决的问题。

希望这能指出你正确的方向。

Answer 2

如果您正在使用BeautifulSoup，那么您将无法以有状态的浏览器方式导航到另一个页面，而只需使用等网址向另一个页面发出另一个请求维基/名称。所以你的代码看起来像

import urllib, csv

with open('out.csv','w') as f:

    csv_file = csv.writer(f)

    #loop through the rows of the table
    for row in senator_rows:
        name = get_name(row)

        ... #extract the other data from the <tr> elt

        senator_page_url = get_url(row)

        #get description from HTML text of senator's page
        description = get_description(get_html(senator_page_url))

        #write this row to the CSV file
        csv_file.writerow([name, ..., description])

#quick way to get the HTML text as string for given url
def get_html(url):
    return urllib.urlopen(url).read()

请注意，在python 3.x中，您将导入并使用urllib.request而不是urllib，并且您必须解码bytes read() 1}}通话将返回。听起来你知道如何填写我留在那里的其他get_*函数，所以我希望这会有所帮助！

刮表并从链接获取更多信息

2 个答案: