如何从python中的文本解析链接?

时间:2018-10-15 17:20:27

标签: html python-2.7 parsing flask text

我正在尝试从文本中解析锚标签中的href,我尝试了以下代码

from flask import Flask,render_template
import requests
import re
app = Flask(__name__)
   @app.route('/')
   def products():
      getprd = requests.get('API')
      jsonobj = getprd.text
      produ= getprd.json()
      prd = produ['items'][0]['id']
      htmlcode = produ['items'][0]['description']
      htmlcodetxt =str(htmlcode)
return render_template('productdisp.html', 
prod=jsonobj, prd=prd, htmlcode=htmlcode)


if __name__ =='__main__':
app.run(debug=True)

和htmlcodetxt包含文本

<p style="text-align: center;"><strong>Part Number:</strong></p><div style="text-align: center;"><span style="font-size: 16px;">product code</span></div><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Lumens:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>6600-7200 LM</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>CCT:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;">5700K</span><br> </p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Input Voltage:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>100-277VAC, 50-60Hz</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong><strong>Certificates:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>UL, DLC</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Warranty:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>5 Years <br></span></p><hr><p style="text-align: center;"><strong>DOWNLOADS:</strong><br></p><p style="text-align: center;"><br></p><p style="text-align: center;"><strong><a href="https://dl.dropbox.com/s/saa.pdf?dl=1" class="fakeButton">Specification Sheet</a><br></strong><br></p><p><br></p><p style="text-align: center;"><strong><a href="https://dl.dropbox.com/s/ds.png?dl=1" class="fakeButton2">Photometric Data</a><br></strong></p><p style="text-align: center;"><br></p><p style="text-align: center;"><img src="https://ul_png"> <img src="https://300x295_png"> </p><p style="text-align: center;"><br></p>

1 个答案:

答案 0 :(得分:0)

一种方法是使用像这样的HTMLParser模块来解析htmlcodetxt字符串中的href链接。

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):

    # Parse the 'anchor' tag.
        if tag == "a":

        # Check the list of defined attributes
            for name, value in attrs:

            # If href is defined, print it.
                if name == "href":
                    print name, "=", value

# Declare it and feed it your HTML content that you want parsed for the href tag.
parser = MyHTMLParser()
parser.feed(htmlcodetxt)

我不确定您的应用处理程序如何工作,但是也许您可以尝试这样的事情?

from flask import Flask,render_template
import requests
import re
from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == "a":
            for name, value in attrs:
                if name == "href":
                    print name, "=", value


app = Flask(__name__)
   @app.route('/')
   def products():
      getprd = requests.get('API')
      jsonobj = getprd.text
      produ= getprd.json()
      prd = produ['items'][0]['id']
      htmlcode = produ['items'][0]['description']
      htmlcodetxt =str(htmlcode)

      parser = MyHTMLParser()
      parser.feed(htmlcodetxt)

return render_template('productdisp.html',
prod=jsonobj, prd=prd, htmlcode=htmlcode)


if __name__ =='__main__':
app.run(debug=True)

例如,在不使用flask的情况下,并且在使用您发布的html代码示例的情况下,以下内容将起作用并返回预期的输出。

#!/usr/bin/python

content = '<p style="text-align: center;"><strong>Part Number:</strong></p><div style="text-align: center;"><span style="font-size: 16px;">product code</span></div><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Lumens:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>6600-7200 LM</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>CCT:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;">5700K</span><br> </p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Input Voltage:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>100-277VAC, 50-60Hz</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong><strong>Certificates:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>UL, DLC</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Warranty:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>5 Years <br></span></p><hr><p style="text-align: center;"><strong>DOWNLOADS:</strong><br></p><p style="text-align: center;"><br></p><p style="text-align: center;"><strong><a href="https://dl.dropbox.com/s/saa.pdf?dl=1" class="fakeButton">Specification Sheet</a><br></strong><br></p><p><br></p><p style="text-align: center;"><strong><a href="https://dl.dropbox.com/s/ds.png?dl=1" class="fakeButton2">Photometric Data</a><br></strong></p><p style="text-align: center;"><br></p><p style="text-align: center;"><img src="https://ul_png"> <img src="https://300x295_png"> </p><p style="text-align: center;"><br></p>'


from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == "a":
            for name, value in attrs:
                if name == "href":
                    print name, "=", value


parser = MyHTMLParser()
parser.feed(content)

示例输出:

$ ./html_parse.py 
href = https://dl.dropbox.com/s/saa.pdf?dl=1
href = https://dl.dropbox.com/s/ds.png?dl=1