查找标签除了具有属性的标签:BeautifulSoup

时间:2013-06-06 08:03:00

标签: python beautifulsoup

在我试图抓取的page中,我想要排除那些具有属性的<td>

<td >点击此处查看阿根廷</td>

的综合区号列表

我想知道使用属性

排除此标记的功能

我的代码获取所有城市和地区代码

from bs4 import BeautifulSoup
import urllib2
import re

url = "http://www.howtocallabroad.com/argentina"
html_page = urllib2.urlopen(url)
soup = BeautifulSoup(html_page)

areatable = soup.find('table',{'id':'codes'})
if areatable is None:
    print "areatable is None"
else:
    d = {}

    def chunks(l, n):
            return [l[i : i + n] for i in range(0, len(l), n)]

    all_td = areatable.findAll('td')
    print all_td

    li = dict(chunks([i.text for i in all_td], 2))
    print li

但是当我尝试打印li时,它会抛出异常:

Traceback (most recent call last):
  File "extract_table.py", line 21, in <module>
    li = dict(chunks([i.text for i in all_td], 2))
ValueError: dictionary update sequence element #30 has length 1; 2 is required

这是我在致电areatable.findAll('td')

时得到的结果
[
<td>Buenos Aires</td>,
<td>11</td>,
<td>La Rioja</td>,
<td>380</td>,
<td>Salta</td>,
<td>387</td>,
<td>Bahia Blanca</td>,
<td>291</td>,
<td>Mar del Plata</td>,
<td>223</td>,
<td>San Juan</td>,
<td>264</td>,
<td>Catamarca<br/></td>,
<td>383</td>,
<td>Mendoza</td>,
<td>261</td>,
<td>San Luis</td>,
<td>266</td>,
<td>Comodoro Rivadavia</td>,
<td>297</td>,
<td>Mercedes/Prov. B.A.</td>,
<td>2324</td>,
<td>San Nicolas</td>,
<td>336</td>,
<td>Concordia</td>,
<td>345</td>,
<td>Neuquen</td>,
<td>299</td>,
<td>San Rafael</td>,
<td>260</td>,
<td>Cordoba</td>,
<td>351</td>,
<td>Parana</td>,
<td>343</td>,
<td>Santa Fe</td>,
<td>342</td>,
<td>Corrientes</td>,
<td>379</td>,
<td>Posadas</td>,
<td>376</td>,
<td>Santiago del Estero</td>,
<td>385</td>,
<td>Formosa</td>,
<td>370</td>,
<td>Resistencia</td>,
<td>362</td>,
<td>Santo Tome</td>,
<td>3756</td>,
<td>Jesus Maria</td>,
<td>3525</td>,
<td>Rio Cuarto</td>,
<td>358</td>,
<td>Tandil</td>,
<td>249</td>,
<td>La Plata</td>,
<td>221</td>,
<td>Rosario</td>,
<td>341</td>,
<td>Trelew</td>,
<td>280</td>,
<td colspan="6" id="more"><a href="http://www.cnc.gov.ar/infotecnica/numeracion/indicativosinter.asp" target="_blank">Click here</a> for a comprehensive area code list for Argentina</td>
]

1 个答案:

答案 0 :(得分:4)

问题是all_td是一个奇数长度,因此chunks函数不能很好地工作。这是一个简单的lambda函数,用于查明标记是否没有属性,您可以使用这些属性来捕获<td>stuff</td>标记:

>>> all_td = filter(lambda x: x.attrs == {}, all_td)
# all_td now contains [<td>Buenos Aires</td>, <td>11</td>, <td>La Rioja</td>, <td>380</td>, <td>Salta</td>, <td>387</td>, <td>Bahia Blanca</td>, <td>291</td>, <td>Mar del Plata</td>, <td>223</td>, <td>San Juan</td>, <td>264</td>, <td>Catamarca<br/></td>, <td>383</td>, <td>Mendoza</td>, <td>261</td>, <td>San Luis</td>, <td>266</td>, <td>Comodoro Rivadavia</td>, <td>297</td>, <td>Mercedes/Prov. B.A.</td>, <td>2324</td>, <td>San Nicolas</td>, <td>336</td>, <td>Concordia</td>, <td>345</td>, <td>Neuquen</td>, <td>299</td>, <td>San Rafael</td>, <td>260</td>, <td>Cordoba</td>, <td>351</td>, <td>Parana</td>, <td>343</td>, <td>Santa Fe</td>, <td>342</td>, <td>Corrientes</td>, <td>379</td>, <td>Posadas</td>, <td>376</td>, <td>Santiago del Estero</td>, <td>385</td>, <td>Formosa</td>, <td>370</td>, <td>Resistencia</td>, <td>362</td>, <td>Santo Tome</td>, <td>3756</td>, <td>Jesus Maria</td>, <td>3525</td>, <td>Rio Cuarto</td>, <td>358</td>, <td>Tandil</td>, <td>249</td>, <td>La Plata</td>, <td>221</td>, <td>Rosario</td>, <td>341</td>, <td>Trelew</td>, <td>280</td>]

简单地说,如果标签没有属性,lambda函数将返回Truefilter()所做的是遍历all_td中的每个元素,并对每个元素运行lambda函数。如果lambda函数返回带有给定标记的False,则会从列表中删除它。返回一个新列表。

现在调用块时,列表中会有大量元素,因此不会出现错误。