我正在尝试使用以下代码中的beautifulsoup从网站访问文章内容:
site= 'www.example.com'
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
content = soup.find_all('p')
content=str(content)
内容对象包含来自' p'内的所有主要文字。标签,但输出中还有其他标签,如下图所示。我想删除匹配的<对中的所有字符。 >标签和标签本身。所以只留下文字。
我尝试了以下方法,但似乎无法正常工作。
' '.join(item for item in content.split() if not (item.startswith('<') and item.endswith('>')))
删除刺痛中的子串的最佳方法是什么?以某种模式开始和结束,例如&lt; &GT;
答案 0 :(得分:11)
使用regEx:
re.sub('<[^<]+?>', '', text)
使用BeautifulSoup :(来自here的解决方案)
import urllib
from bs4 import BeautifulSoup
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
使用NLTK:
import nltk
from urllib import urlopen
url = "https://stackoverflow.com/questions/tagged/python"
html = urlopen(url).read()
raw = nltk.clean_html(html)
print(raw)
答案 1 :(得分:6)
您可以使用get_text()
for i in content:
print i.get_text()
以下示例来自docs:
>>> markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
>>> soup = BeautifulSoup(markup)
>>> soup.get_text()
u'\nI linked to example.com\n'
答案 2 :(得分:1)
您需要使用strings generator:
for text in content.strings:
print(text)
答案 3 :(得分:0)
Pyparsing通过定义匹配所有打开和关闭HTML标记的模式,然后使用该模式转换输入作为抑制器,可以轻松编写HTML剥离器。这仍然会保留要转换的&xxx;
HTML实体 - 您可以使用xml.sax.saxutils.unescape
来执行此操作:
source = """
<p><strong>Editors' Pick: Originally published March 22.<br /> <br /> Apple</strong> <span class=" TICKERFLAT">(<a href="/quote/AAPL.html">AAPL</a> - <a href="http://secure2.thestreet.com/cap/prm.do?OID=028198&ticker=AAPL">Get Report</a><a class=" arrow" href="/quote/AAPL.html"><span class=" tickerChange" id="story_AAPL"></span></a>)</span> is waking up the echoes with the reintroduction of a 4-inch iPhone, a model its creators hope will lead the company to victory not just in emerging markets, but at home as well.</p>
<p>"There's significant pent-up demand within Apple's base of iPhone owners who want a smaller iPhone with up-to-date specs and newer features," Jackdaw Research Chief Analyst Jan Dawson said in e-mailed comments.</p>
<p>The new model, dubbed the iPhone SE, "should unleash a decent upgrade cycle over the coming months," Dawson said. Prior to the iPhone 6 and 6 Plus, introduced in 2014, Apple's iPhones were small, at 3.5 inches and 4 inches tall, respectively, compared with models by Samsung and others that approached 6 inches.</p>
<div class=" butonTextPromoAd">
<div class=" ym" id="ym_44444440"></div>"""
from pyparsing import anyOpenTag, anyCloseTag
from xml.sax.saxutils import unescape as unescape
unescape_xml_entities = lambda s: unescape(s, {"'": "'", """: '"', " ":" "})
stripper = (anyOpenTag | anyCloseTag).suppress()
print(unescape_xml_entities(stripper.transformString(source)))
给出:
Editors' Pick: Originally published March 22. Apple (AAPL - Get Report) is waking up the echoes with the reintroduction of a 4-inch iPhone, a model its creators hope will lead the company to victory not just in emerging markets, but at home as well.
"There's significant pent-up demand within Apple's base of iPhone owners who want a smaller iPhone with up-to-date specs and newer features," Jackdaw Research Chief Analyst Jan Dawson said in e-mailed comments.
The new model, dubbed the iPhone SE, "should unleash a decent upgrade cycle over the coming months," Dawson said. Prior to the iPhone 6 and 6 Plus, introduced in 2014, Apple's iPhones were small, at 3.5 inches and 4 inches tall, respectively, compared with models by Samsung and others that approached 6 inches.
(将来,请不要提供示例文本或代码作为非可复制粘贴的图像。)
答案 4 :(得分:0)
如果您限制使用任何库,则可以使用以下代码删除html标记。
我只是纠正你的尝试。谢谢你的想法
content="<h4 style='font-size: 11pt; color: rgb(67, 67, 67); font-family: arial, sans-serif;'>Sample text for display.</h4> <p> </p>"
' '.join([word for line in [item.strip() for item in content.replace('<',' <').replace('>','> ').split('>') if not (item.strip().startswith('<') or (item.strip().startswith('&') and item.strip().endswith(';')))] for word in line.split() if not (word.strip().startswith('<') or (word.strip().startswith('&') and word.strip().endswith(';')))])
答案 5 :(得分:0)
无需任何模块和其他库就可以在所有语言中运行的简单算法。 代码是自我记录的:
def removetags_fc(data_str):
appendingmode_bool = True
output_str = ''
for char_str in data_str:
if char_str == '>':
appendingmode_bool = False
elif char_str == '<':
appendingmode_bool = True
continue
if appendingmode_bool:
output_str += char_str
return output_str
为了更好地实现,在循环开始之前,必须先在内存中实例化文字'>'和'<'。