python从html标签中提取数据

时间:2017-11-23 05:59:23

标签: python html python-3.x

我想在Python

中的html标签中提取(段落)
 <p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, helvetica, sans-serif;">

 Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.

 </span></p>

我的代码是

 from HTMLParser import HTMLParser
 from bs4 import BeautifulSoup

x = """<p style="text-align: justify;"><span style=&  quot;font-size: small; font-family: lato, arial, helvetica, sans-serif;"> Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. </span></p>"""

p1 = HTMLParser()
p1.unescape(x)
bdy_soup = BeautifulSoup(p1.unescape(x)).get_text(separator=";")
print(bdy_soup)

此代码未返回任何内容请帮助我这样做,任何帮助将不胜感激

4 个答案:

答案 0 :(得分:1)

  1. 使用html.unescape将html char转换为ascii
  2. 使用bs4.BeautifulSoup(html_content).text提取内容
  3. >>> x = """<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, helvetica, sans-serif;"> Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. </span></p>"""
    
    >>> import html
    >>> xx = html.unescape(x)
    '<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, helvetica, sans-serif;">\n\n Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.\n\n </span></p>'
    
    >>> import bs4
    >>> bs4.BeautifulSoup(xx, "html").text
    ' Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. '
    

答案 1 :(得分:1)

你可以这样做。请先安装HTMLParserbeautifulsoup4

from HTMLParser import HTMLParser
p = "&lt;p style=&quot;text-align: justify;&quot;&gt;&lt;span 
 style=&quot;font-size: small; font-family: lato, arial, helvetica, sans-serif;&quot;&gt; Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. &lt;/span&gt;&lt;/p&gt;"
from bs4 import BeautifulSoup
p1 = HTMLParser()
p1.unescape(p)
bdy_soup = BeautifulSoup(p1.unescape(p)).get_text(separator="\n")
print bdy_soup

答案 2 :(得分:0)

您可以使用正则表达式在两个HTML标记之间提取数据

r'<title[^>]*>([^<]+)</title>'

答案 3 :(得分:0)

The code worked by installing lxml parser.. thankyou everyone for your help

 import html
 import bs4
 import html.parser
 import lxml
 from bs4 import BeautifulSoup

 x = """&lt;p style=&quot;text-align: justify;&quot;&gt;&lt;span style=&  quot;font-size: small; font-family: lato, arial, helvetica, sans-serif;&quot;&gt; Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. &lt;/span&gt;&lt;/p&gt;"""

 p1 = html.unescape(x) 
 bdy_soup = bs4.BeautifulSoup(p1, "lxml").get_text(separator="/n")
 print(bdy_soup)