Question

我无法理解bs4解析如何在标签层次结构中提取多个级别的信息。

以下是我尝试解析的内容示例（来自www.j-archive.com/showgame.php?game_id=50）：

...
<table>
<tr>
  <td>
    <div onmouseover="toggle('clue_DJ_1_1', 'clue_DJ_1_1_stuck', '&lt;em class=&quot;correct_response&quot;&gt;&lt;i&gt;The Red Badge of Courage&lt;/i&gt;&lt;/em&gt;&lt;br /&gt;&lt;br /&gt;&lt;table width=&quot;100%&quot;&gt;&lt;tr&gt;&lt;td class=&quot;right&quot;&gt;Kelley&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;')" onmouseout="toggle('clue_DJ_1_1', 'clue_DJ_1_1_stuck', 'This classic by Stephen Crane is subtitled &quot;An Episode of the American Civil War&quot;')" onclick="togglestick('clue_DJ_1_1_stuck')">

我特别希望得到“勇敢的红色徽章”字样，以便<table>，<tr>，<td>和<div>，然后似乎是属性onmouseover的一部分。

我可以用以下内容提取所有onmouseover语句：

for tag in soup.findAll(onmouseover=True):
    print(tag['onmouseover'])

但是我不知道在这个输出中解析内容很热。

提前致谢。

Answer 1

由于您感兴趣的文本位于<em>标记中，因此使用子字符串索引进行解析非常容易：

import requests
from bs4 import BeautifulSoup

req = requests.get('http://www.j-archive.com/showgame.php?game_id=50')
soup = BeautifulSoup(req.text, 'lxml')

for tag in soup.findAll('div',onmouseover=True):
    parseText = str(tag['onmouseover'])
    tag1 = '<em class="correct_response">'
    tag2 = '</em>'
    i1 = parseText.index(tag1)
    i2 = parseText.index(tag2)
    print(parseText[i1+len(tag1):i2])

此代码获得了所有条目，直到'Final Jeopardy'。

BeautifulSoup4解析标签层次结构

1 个答案: