我是编程,Python和BS4的新手,我希望通过网络爬虫项目做得更好。我有一堆类似的页面,其中包含我想要分开的信息。这是我需要使用的模板:
<h3>Synopsis</h3>
<p>First part of synopsis</p>
<p>Second part of paragraph</p>
<p>Third part of paragraph</p>
<p class="writerDirector"><strong>Written By:</strong> Writer<br>
<strong>Directed By:</strong> Director</p>
<h4>Cast</h4>
<p>List of the cast in one line</p>
“指导者”和“书面说明”的信息非常容易收集,但我也希望有一个概要和演员段落。问题是,概要并不总是在网站上长三段(有时更少,有时更多),所以我不能硬编码。我的想法是在文本中使用“概要”这个词作为起点和关闭点并收集其间的所有内容,我只是不确定如何实现这一点。我尝试使用正则表达式,但我不知道它那么多,我不知道如何在正则表达式中使用html标签。
任何帮助都将不胜感激。
答案 0 :(得分:1)
from bs4 import BeautifulSoup
text = """<h3>Synopsis</h3>
<p>First part of synopsis</p>
<p>Second part of paragraph</p>
<p>Third part of paragraph</p>
<p class="writerDirector"><strong>Written By:</strong> Writer<br>
<strong>Directed By:</strong> Director</p>
<h4>Cast</h4>
<p>List of the cast in one line</p>"""
soup = BeautifulSoup(text, "html.parser")
synopsis = ''
for para in soup.find_all("p"):
if para.get('class') == ['writerDirector']:
break
synopsis += para.text + '\n'
print(synopsis)
输出:
First part of synopsis
Second part of paragraph
Third part of paragraph
获取案例需要一些硬编码:
cast_text = text[text.index('<h4>Cast</h4>'):]
soup = BeautifulSoup(cast_text, "html.parser")
cast_members = ''
for para in soup.find_all('p'):
cast_members += para.text + '\n'
print(cast_members)
输出:
List of the cast in one line
答案 1 :(得分:0)
这可能会捕捉到满足您需要的技术的基本要素。
您知道所需的内容以H3
元素开头。然后你开始浏览它的next_siblings
。空行('\ n')等兄弟姐妹拥有sibling.name
None
,我们可以安全地忽略它们。此代码为sibling.name
元素的每个兄弟显示sibling
和完整H3
。您已表明您已经知道如何挖掘这些内容。
现在你所要做的就是编写代码,当它看到'Cast'的h4
元素时会发出通知,这样它就可以安排为其中的玩家再读一个p
元素。铸造。
>>> HTML = '''\
... <h3>Synopsis</h3>
... <p>First part of synopsis</p>
... <p>Second part of paragraph</p>
... <p>Third part of paragraph</p>
... <p class="writerDirector"><strong>Written By:</strong> Writer<br>
... <strong>Directed By:</strong> Director</p>
... <h4>Cast</h4>
... <p>List of the cast in one line</p>
... '''
>>> import bs4
>>> soup = bs4.BeautifulSoup(HTML, 'lxml')
>>> h3 = soup.find('h3')
>>> for sibling in h3.next_siblings:
... if sibling.name:
... sibling.name
... sibling
...
'p'
<p>First part of synopsis</p>
'p'
<p>Second part of paragraph</p>
'p'
<p>Third part of paragraph</p>
'p'
<p class="writerDirector"><strong>Written By:</strong> Writer<br/>
<strong>Directed By:</strong> Director</p>
'h4'
<h4>Cast</h4>
'p'
<p>List of the cast in one line</p>
答案 2 :(得分:0)
假设您在一个页面上有多个概要(即使您没有),您可以迭代汤并收集h3概要标签之间的所有内容:
from bs4 import BeautifulSoup
html ="""<html><h3>Synopsis</h3>
<p>First part of synopsis</p>
<p>Second part of paragraph</p>
<p>Third part of paragraph</p>
<p class="writerDirector"><strong>Written By:</strong> Writer<br>
<strong>Directed By:</strong> Director</p>
<h4>Cast</h4>
<p>List of the cast in one line</p>
<h3>Synopsis</h3>
<p>First part of synopsis 2</p>
<p>Second part of paragraph 2</p>
<p class="writerDirector"><strong>Written By:</strong> Writer 2<br>
<strong>Directed By:</strong> Director 2</p>
<h4>Cast</h4>
<p>List of the cast in one line 2</p></html>"""
soup = BeautifulSoup(html, 'lxml')
value = ""
start = False
for i in soup.find_all():
if i.name == 'h3' and i.string=='Synopsis':
if start:
print (value)
value = ""
print ("Synopsis")
start = True
elif i.text is not None and start:
value = value + " " + i.text
if value:
print (value)