我正在尝试从文本中解析锚标签中的href,我尝试了以下代码
from flask import Flask,render_template
import requests
import re
app = Flask(__name__)
@app.route('/')
def products():
getprd = requests.get('API')
jsonobj = getprd.text
produ= getprd.json()
prd = produ['items'][0]['id']
htmlcode = produ['items'][0]['description']
htmlcodetxt =str(htmlcode)
return render_template('productdisp.html',
prod=jsonobj, prd=prd, htmlcode=htmlcode)
if __name__ =='__main__':
app.run(debug=True)
和htmlcodetxt包含文本
<p style="text-align: center;"><strong>Part Number:</strong></p><div style="text-align: center;"><span style="font-size: 16px;">product code</span></div><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Lumens:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>6600-7200 LM</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>CCT:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;">5700K</span><br> </p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Input Voltage:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>100-277VAC, 50-60Hz</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong><strong>Certificates:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>UL, DLC</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Warranty:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>5 Years <br></span></p><hr><p style="text-align: center;"><strong>DOWNLOADS:</strong><br></p><p style="text-align: center;"><br></p><p style="text-align: center;"><strong><a href="https://dl.dropbox.com/s/saa.pdf?dl=1" class="fakeButton">Specification Sheet</a><br></strong><br></p><p><br></p><p style="text-align: center;"><strong><a href="https://dl.dropbox.com/s/ds.png?dl=1" class="fakeButton2">Photometric Data</a><br></strong></p><p style="text-align: center;"><br></p><p style="text-align: center;"><img src="https://ul_png"> <img src="https://300x295_png"> </p><p style="text-align: center;"><br></p>
答案 0 :(得分:0)
一种方法是使用像这样的HTMLParser模块来解析htmlcodetxt字符串中的href链接。
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
# Parse the 'anchor' tag.
if tag == "a":
# Check the list of defined attributes
for name, value in attrs:
# If href is defined, print it.
if name == "href":
print name, "=", value
# Declare it and feed it your HTML content that you want parsed for the href tag.
parser = MyHTMLParser()
parser.feed(htmlcodetxt)
我不确定您的应用处理程序如何工作,但是也许您可以尝试这样的事情?
from flask import Flask,render_template
import requests
import re
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == "a":
for name, value in attrs:
if name == "href":
print name, "=", value
app = Flask(__name__)
@app.route('/')
def products():
getprd = requests.get('API')
jsonobj = getprd.text
produ= getprd.json()
prd = produ['items'][0]['id']
htmlcode = produ['items'][0]['description']
htmlcodetxt =str(htmlcode)
parser = MyHTMLParser()
parser.feed(htmlcodetxt)
return render_template('productdisp.html',
prod=jsonobj, prd=prd, htmlcode=htmlcode)
if __name__ =='__main__':
app.run(debug=True)
例如,在不使用flask的情况下,并且在使用您发布的html代码示例的情况下,以下内容将起作用并返回预期的输出。
#!/usr/bin/python
content = '<p style="text-align: center;"><strong>Part Number:</strong></p><div style="text-align: center;"><span style="font-size: 16px;">product code</span></div><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Lumens:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>6600-7200 LM</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>CCT:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;">5700K</span><br> </p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Input Voltage:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>100-277VAC, 50-60Hz</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong><strong>Certificates:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>UL, DLC</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Warranty:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>5 Years <br></span></p><hr><p style="text-align: center;"><strong>DOWNLOADS:</strong><br></p><p style="text-align: center;"><br></p><p style="text-align: center;"><strong><a href="https://dl.dropbox.com/s/saa.pdf?dl=1" class="fakeButton">Specification Sheet</a><br></strong><br></p><p><br></p><p style="text-align: center;"><strong><a href="https://dl.dropbox.com/s/ds.png?dl=1" class="fakeButton2">Photometric Data</a><br></strong></p><p style="text-align: center;"><br></p><p style="text-align: center;"><img src="https://ul_png"> <img src="https://300x295_png"> </p><p style="text-align: center;"><br></p>'
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == "a":
for name, value in attrs:
if name == "href":
print name, "=", value
parser = MyHTMLParser()
parser.feed(content)
示例输出:
$ ./html_parse.py
href = https://dl.dropbox.com/s/saa.pdf?dl=1
href = https://dl.dropbox.com/s/ds.png?dl=1