我想在Python
中的html标签中提取(段落) <p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, helvetica, sans-serif;">
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
</span></p>
我的代码是
from HTMLParser import HTMLParser
from bs4 import BeautifulSoup
x = """<p style="text-align: justify;"><span style=& quot;font-size: small; font-family: lato, arial, helvetica, sans-serif;"> Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. </span></p>"""
p1 = HTMLParser()
p1.unescape(x)
bdy_soup = BeautifulSoup(p1.unescape(x)).get_text(separator=";")
print(bdy_soup)
此代码未返回任何内容请帮助我这样做,任何帮助将不胜感激
答案 0 :(得分:1)
html.unescape
将html char转换为ascii bs4.BeautifulSoup(html_content).text
提取内容>>> x = """<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, helvetica, sans-serif;"> Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. </span></p>"""
>>> import html
>>> xx = html.unescape(x)
'<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, helvetica, sans-serif;">\n\n Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.\n\n </span></p>'
>>> import bs4
>>> bs4.BeautifulSoup(xx, "html").text
' Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. '
答案 1 :(得分:1)
你可以这样做。请先安装HTMLParser
和beautifulsoup4
。
from HTMLParser import HTMLParser
p = "<p style="text-align: justify;"><span
style="font-size: small; font-family: lato, arial, helvetica, sans-serif;"> Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. </span></p>"
from bs4 import BeautifulSoup
p1 = HTMLParser()
p1.unescape(p)
bdy_soup = BeautifulSoup(p1.unescape(p)).get_text(separator="\n")
print bdy_soup
答案 2 :(得分:0)
您可以使用正则表达式在两个HTML标记之间提取数据
r'<title[^>]*>([^<]+)</title>'
答案 3 :(得分:0)
The code worked by installing lxml parser.. thankyou everyone for your help
import html
import bs4
import html.parser
import lxml
from bs4 import BeautifulSoup
x = """<p style="text-align: justify;"><span style=& quot;font-size: small; font-family: lato, arial, helvetica, sans-serif;"> Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. </span></p>"""
p1 = html.unescape(x)
bdy_soup = bs4.BeautifulSoup(p1, "lxml").get_text(separator="/n")
print(bdy_soup)