我正在尝试使用以下HTML从网页中提取文章文字:
<html>
<body>
<div id='article_body'>
<h2 class='article_subtitle'>subtitle_1</h2>
<p class ='article_paragraph'>text_1</p2>
<p class ='article_paragraph'>text_2</p2>
<p class ='article_paragraph'>text_3</p2>
<h2 class='article_subtitle'>subtitle_2</h2>
<p class ='article_paragraph'>text_4</p2>
<p class ='article_paragraph'>text_5</p2>
<h4 class='videoTitle'>I DONT WANT THIS TEXT</h4>
</div>
</body>
</html>
我试过了:
import urllib.request
from bs4 import BeautifulSoup
url = "http://......."
source = urllib.request.urlopen(url).read()
soup = BeautifulSoup(source, 'lxml')
article_text = ''
article = soup.find('div', {'id': 'article_body'}).find_all(text=True)
for element in article:
article_text += '\n'+ ''.join(element)
print(article_text)
然后我也会收到来自<h4>
的文字。有什么建议如何避免这种情况?
答案 0 :(得分:0)
试试这个
soup = BeautifulSoup(source, 'lxml')
article_text = ''
article = soup.find('div', {'id': 'article_body'}).find_all()
for element in article:
if '<h4' in str(element):
continue
# with html tag
# article_text += '\n'+ ''.join(str(element))
# inner text only
article_text += '\n'+ ''.join(element.text)
print(article_text)
答案 1 :(得分:0)
假设您想要的所有内容都以article
articles = soup.find_all(class_ = lambda c: c and c.startswith('article'))
print('\n'.join(articles))