Question

想象一下，我有以下内容：

cont="""<a id="test1" class="SSSS" title="DDDD" href="AAAA">EXAMPLE1</a>.....<a id="test2" class="GGGG" title="ZZZZ" href="VVVV">EXAMPLE2</a>....
"""

我想要的是什么：

id1='test1'
id2='test2'
idn='testn'

你可以纠正我吗？

if '<a id=' in cont:
  ....?

我是否必须在python中使用正则表达式，或者 xpath 有一种方法来抓取它们？

注意：我希望所有ID仅在标记a

中

Answer 1

在此处下载bs4：http://www.crummy.com/software/BeautifulSoup/

文档：http://www.crummy.com/software/BeautifulSoup/bs4/doc/

这应该有效：

from bs4 import BeautifulSoup

soup = BeautifulSoup(cont)
for a in soup.select('a'):  # Or soup.find_all('a') if you prefer
    if a.get('id') is not None:
        print a.get('id')

或者理解得到一个清单：

ids = [a.get('id') for a in BeautifulSoup(cont).select('a') if a.get('id') is not None]

Answer 2

通过列表理解和BeautifulSoup。

>>> from bs4 import BeautifulSoup
>>> cont="""<a id="test1" class="SSSS" title="DDDD" href="AAAA">EXAMPLE1</a>.....<a id="test2" class="GGGG" title="ZZZZ" href="VVVV">EXAMPLE2</a>....
"""
>>> soup = BeautifulSoup(cont)
>>> [i.get('id') for i in soup.findAll('a') if i.get('id') != None]
['test1', 'test2']
>>> [i['id'] for i in soup.findAll('a') if i['id'] != None]
['test1', 'test2']

从url源中捕获python中的xpath

2 个答案: