捕获python中特定标记之间的数据

时间:2015-11-01 13:19:32

标签: python find

我在python中获取url内容...我想捕获<h1></h1>之间的所有内容。

我尝试的是:

myString='''<h1>kgkgjgjgkjgkjgkj</h1>
<h1>kdfgggggggggggggggggggkgjgjgkjgkjgkj</h1>
dsfgdfgg
<h1>kgkgjgjgkdfgdfgdgdfjgkjgkj</h1>
dfgdffdgf
<h1>kgkgjgjsdssssssssssssssssssssgkjgkjgkj</h1>
dfgdfgdg
<h1>kgkgjgjgkjgkjgkgggggggggggggggggggj</h1>
'''
if '<h1>' in myString:
    startString='<h1>'
    endString='</h1>'
    print myString[myString.find(startString)+len(startString):myString.find(endString)]

我有多个h1标签。但它捕获第一个h1标签之间的数据。

如何在所有h1标签之间捕获数据?

5 个答案:

答案 0 :(得分:1)

您可以使用简单的regular expression

来实现
import re
print re.findall(r'<h1>(.*?)</h1>', myString)

另一种方法是使用Beautiful Soup作为HTML解析器(如果你想解析现实世界的HTML页面,这是更优选的方式):

from bs4 import BeautifulSoup
soup = BeautifulSoup(myString)
print [h1.string for h1 in soup.find_all('h1')]

BeautifulSoup未包含在标准库中,因此您需要手动安装它。您可以通过pip轻松安装它:

pip install beautifulsoup4

答案 1 :(得分:1)

使用BeautifulSoup解析器。

>>> from bs4 import BeautifulSoup
>>> myString='''<h1>kgkgjgjgkjgkjgkj</h1>
<h1>kdfgggggggggggggggggggkgjgjgkjgkjgkj</h1>
dsfgdfgg
<h1>kgkgjgjgkdfgdfgdgdfjgkjgkj</h1>
dfgdffdgf
<h1>kgkgjgjsdssssssssssssssssssssgkjgkjgkj</h1>
dfgdfgdg
<h1>kgkgjgjgkjgkjgkgggggggggggggggggggj</h1>
'''
>>> soup = BeautifulSoup(myString)
>>> h1 = soup.select('h1')
>>> for i in h1:
    print i.text


kgkgjgjgkjgkjgkj
kdfgggggggggggggggggggkgjgjgkjgkjgkj
kgkgjgjgkdfgdfgdgdfjgkjgkj
kgkgjgjsdssssssssssssssssssssgkjgkjgkj
kgkgjgjgkjgkjgkgggggggggggggggggggj
>>> 

答案 2 :(得分:1)

我会选择Beautifulsoup--我的尝试

from bs4 import BeautifulSoup
import requests

url = 'http://accessibility.psu.edu/headingshtml/'

respons = requests.get(url).content

soup = BeautifulSoup(respons,'lxml')

h1tags = soup.find_all('h1')

for singleTag in h1tags:
    print singleTag.text

打印(在这种情况下只有一个h1标签)

Heading Tags (H1, H2, H3, P) in HTML

答案 3 :(得分:0)

美丽汤的工作示例

bool(true)

答案 4 :(得分:0)

简单列表补偿解决方案:

print [s.split('</h1>')[0] for s in myString.split('<h1>')[1:]]