Python-> Beautifulsoup-> Webscraping->通过URL(1到53)循环并保存结果

时间:2016-08-11 12:50:08

标签: python web-scraping beautifulsoup

Here is the Website I am trying to scrape http://livingwage.mit.edu/

具体的网址来自

http://livingwage.mit.edu/states/01

http://livingwage.mit.edu/states/02

http://livingwage.mit.edu/states/04 (For some reason they skipped 03)

...all the way to...

http://livingwage.mit.edu/states/56

在每个URL上,我需要第二个表的最后一行:

  

http://livingwage.mit.edu/states/01

的示例      

税前所需年收入$ 20,260 $ 42,786 $ 51,642     $ 64,767 $ 34,325 $ 42,305 $ 47,345 $ 53,206 $ 34,325 $ 47,691     $ 56,934 $ 66,997

欲望输出:

Alabama $ 20,260 $ 42,786 $ 51,642 $ 64,767 $ 34,325 $ 42,305 $ 47,345 $ 53,206 $ 34,325 $ 47,691 $ 56,934 $ 66,997

Alaska $ 24,070 $ 49,295 $ 60,933 $ 79,871 $ 38,561 $ 47,136 $ 52,233 $ 61,531 $ 38,561 $ 54,433 $ 66,316 $ 82,403

...

...

怀俄明州$ 20,867 $ 42,689 $ 52,007 $ 65,892 $ 34,988 $ 41,887 $ 46,983 $ 53,549 $ 34,988 $ 47,826 $ 57,391 $ 68,424

经过2个小时的搞乱,这是我到目前为止(我是初学者):

import requests, bs4

res = requests.get('http://livingwage.mit.edu/states/01')

res.raise_for_status()
states = bs4.BeautifulSoup(res.text)


state_name=states.select('h1')

table = states.find_all('table')[1]
rows = table.find_all('tr', 'odd')[4:]


result=[]

result.append(state_name)
result.append(rows)

当我在Python Console中查看state_name和行时,它会给我html元素

[<h1>Living Wag...Alabama</h1>]

[<tr class = "odd...   </td> </tr>]

问题1:这些是我想要的输出所需的东西,但是如何让python以字符串格式而不是像上面的HTML那样给我?

问题2:如何遍历request.get(url01到url56)?

感谢您的帮助。

如果你能提供一种更有效的方法来获取我的代码中的rows变量,我会非常感激,因为我到达那里的方式并不是非常Pythonic。

2 个答案:

答案 0 :(得分:5)

从初始页面获取所有状态,然后您可以选择第二个表并使用 css类 奇数结果来获取 tr

import requests
from bs4 import BeautifulSoup
from urllib.parse import  urljoin # python2 -> from urlparse import urljoin 


base = "http://livingwage.mit.edu"
res = requests.get(base)

res.raise_for_status()
states = []
# Get all state urls and state name from the anchor tags on the base page.
# td + td skips the first td which is *Required annual income before taxes*
# get all the anchors inside each li that are children of the
# ul with the css class  "states list".
for a in BeautifulSoup(res.text, "html.parser").select("ul.states.list-unstyled li a"):
    # The hrefs look like "/states/51/locations".
    #  We want everything before /locations so we split on / from the right -> /states/51/
    # and join to the base url. The anchor text also holds the state name,
    # so we return the full url and the state, i.e "http://livingwage.mit.edu/states/01 "Alabama".
    states.append((urljoin(base, a["href"].rsplit("/", 1)[0]), a.text))


def parse(soup):
    # Get the second table, indexing in css starts at 1, so table:nth-of-type(2)" gets the second table.
    table = soup.select_one("table:nth-of-type(2)")
    # To get the text, we just need find all the tds and call .text on each.
    #  Each td we want has the css class "odd results", td + td starts from the second as we don't want the first.
    return [td.text.strip() for td in table.select_one("tr.odd.results").select("td + td")]


# Unpack the url and state from each tuple in our states list. 
for url, state in states:
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    print(state, parse(soup))

如果您运行代码,您将看到如下输出:

Alabama ['$21,144', '$43,213', '$53,468', '$67,788', '$34,783', '$41,847', '$46,876', '$52,531', '$34,783', '$48,108', '$58,748', '$70,014']
Alaska ['$24,070', '$49,295', '$60,933', '$79,871', '$38,561', '$47,136', '$52,233', '$61,531', '$38,561', '$54,433', '$66,316', '$82,403']
Arizona ['$21,587', '$47,153', '$59,462', '$78,112', '$36,332', '$44,913', '$50,200', '$58,615', '$36,332', '$52,483', '$65,047', '$80,739']
Arkansas ['$19,765', '$41,000', '$50,887', '$65,091', '$33,351', '$40,337', '$45,445', '$51,377', '$33,351', '$45,976', '$56,257', '$67,354']
California ['$26,249', '$55,810', '$64,262', '$81,451', '$42,433', '$52,529', '$57,986', '$68,826', '$42,433', '$61,328', '$70,088', '$84,192']
Colorado ['$23,573', '$51,936', '$61,989', '$79,343', '$38,805', '$47,627', '$52,932', '$62,313', '$38,805', '$57,283', '$67,593', '$81,978']
Connecticut ['$25,215', '$54,932', '$64,882', '$80,020', '$39,636', '$48,787', '$53,857', '$61,074', '$39,636', '$60,074', '$70,267', '$82,606']

你可以在1-53的范围内循环,但是从基页中提取锚也可以在一个步骤中为我们提供州名,使用该页面中的h1也可以为你输出生活工资计算阿拉巴马然后你必须尝试解析只是得到一个不会微不足道的名字,因为某些州有更多的单词名称。

答案 1 :(得分:2)

  

问题1:这些是我想要的输出所需的东西,但是如何让python以字符串格式而不是像上面的HTML那样给我?

您可以通过以下方式执行以下操作来获取文本:

state_name=states.find('h1').text

同样可以应用于每一行。

  

问题2:如何遍历request.get(url01到url56)?

相同的代码块可以放在1到56的循环中,如下所示:

for i in range(1,57):
    res = requests.get('http://livingwage.mit.edu/states/'+str(i).zfill(2))
    ...rest of the code...

zfill会添加那些前导零。此外,如果将requests.get括在try-except块中会更好,这样即使网址错误,循环也会正常继续。