从url源中捕获python中的xpath

时间:2014-11-06 08:05:14

标签: python regex xpath beautifulsoup

想象一下,我有以下内容:

cont="""<a id="test1" class="SSSS" title="DDDD" href="AAAA">EXAMPLE1</a>.....<a id="test2" class="GGGG" title="ZZZZ" href="VVVV">EXAMPLE2</a>....
"""

我想要的是什么:

id1='test1'
id2='test2'
idn='testn'
你可以纠正我吗?

if '<a id=' in cont:
  ....?

我是否必须在python中使用正则表达式,或者 xpath 有一种方法来抓取它们?

注意:我希望所有ID仅在标记a

2 个答案:

答案 0 :(得分:1)

在此处下载bs4:http://www.crummy.com/software/BeautifulSoup/

文档:http://www.crummy.com/software/BeautifulSoup/bs4/doc/

这应该有效:

from bs4 import BeautifulSoup

soup = BeautifulSoup(cont)
for a in soup.select('a'):  # Or soup.find_all('a') if you prefer
    if a.get('id') is not None:
        print a.get('id')

或者理解得到一个清单:

ids = [a.get('id') for a in BeautifulSoup(cont).select('a') if a.get('id') is not None]

答案 1 :(得分:1)

通过列表理解和BeautifulSoup。

>>> from bs4 import BeautifulSoup
>>> cont="""<a id="test1" class="SSSS" title="DDDD" href="AAAA">EXAMPLE1</a>.....<a id="test2" class="GGGG" title="ZZZZ" href="VVVV">EXAMPLE2</a>....
"""
>>> soup = BeautifulSoup(cont)
>>> [i.get('id') for i in soup.findAll('a') if i.get('id') != None]
['test1', 'test2']
>>> [i['id'] for i in soup.findAll('a') if i['id'] != None]
['test1', 'test2']