我在python中获取url内容...我想捕获<h1>
和</h1>
之间的所有内容。
我尝试的是:
myString='''<h1>kgkgjgjgkjgkjgkj</h1>
<h1>kdfgggggggggggggggggggkgjgjgkjgkjgkj</h1>
dsfgdfgg
<h1>kgkgjgjgkdfgdfgdgdfjgkjgkj</h1>
dfgdffdgf
<h1>kgkgjgjsdssssssssssssssssssssgkjgkjgkj</h1>
dfgdfgdg
<h1>kgkgjgjgkjgkjgkgggggggggggggggggggj</h1>
'''
if '<h1>' in myString:
startString='<h1>'
endString='</h1>'
print myString[myString.find(startString)+len(startString):myString.find(endString)]
我有多个h1
标签。但它捕获第一个h1标签之间的数据。
如何在所有h1
标签之间捕获数据?
答案 0 :(得分:1)
您可以使用简单的regular expression:
来实现import re
print re.findall(r'<h1>(.*?)</h1>', myString)
另一种方法是使用Beautiful Soup作为HTML解析器(如果你想解析现实世界的HTML页面,这是更优选的方式):
from bs4 import BeautifulSoup
soup = BeautifulSoup(myString)
print [h1.string for h1 in soup.find_all('h1')]
BeautifulSoup未包含在标准库中,因此您需要手动安装它。您可以通过pip轻松安装它:
pip install beautifulsoup4
答案 1 :(得分:1)
使用BeautifulSoup解析器。
>>> from bs4 import BeautifulSoup
>>> myString='''<h1>kgkgjgjgkjgkjgkj</h1>
<h1>kdfgggggggggggggggggggkgjgjgkjgkjgkj</h1>
dsfgdfgg
<h1>kgkgjgjgkdfgdfgdgdfjgkjgkj</h1>
dfgdffdgf
<h1>kgkgjgjsdssssssssssssssssssssgkjgkjgkj</h1>
dfgdfgdg
<h1>kgkgjgjgkjgkjgkgggggggggggggggggggj</h1>
'''
>>> soup = BeautifulSoup(myString)
>>> h1 = soup.select('h1')
>>> for i in h1:
print i.text
kgkgjgjgkjgkjgkj
kdfgggggggggggggggggggkgjgjgkjgkjgkj
kgkgjgjgkdfgdfgdgdfjgkjgkj
kgkgjgjsdssssssssssssssssssssgkjgkjgkj
kgkgjgjgkjgkjgkgggggggggggggggggggj
>>>
答案 2 :(得分:1)
我会选择Beautifulsoup--我的尝试
from bs4 import BeautifulSoup
import requests
url = 'http://accessibility.psu.edu/headingshtml/'
respons = requests.get(url).content
soup = BeautifulSoup(respons,'lxml')
h1tags = soup.find_all('h1')
for singleTag in h1tags:
print singleTag.text
打印(在这种情况下只有一个h1标签)
Heading Tags (H1, H2, H3, P) in HTML
答案 3 :(得分:0)
美丽汤的工作示例
bool(true)
答案 4 :(得分:0)
简单列表补偿解决方案:
print [s.split('</h1>')[0] for s in myString.split('<h1>')[1:]]