我正在尝试使用带有BeautifulSoup的Python来浏览一个页面,其中包含ID值增加1的部分,我正在尝试获取它们的视频。但是,视频的数量是可变的,具体取决于跨度id,如下所示,它也不是嵌套在原始tr下。
现在我正在做一个循环来获取span id值,但是我想找到一种方法将vid值作为每个span id的数组。
以下是我正在使用的html示例:
<tr>
<td>
<div>
<span class="apple-font" id="001">
</div>
</td>
</tr>
<tr>
</tr>
<tr>
<td>
<a vid="0099882"></a>
</td>
</tr>
<tr>
<td>
<a vid="0099883"></a>
</td>
</tr>
<tr>
<td>
<a vid="0099883"></a>
</td>
</tr>
<tr>
<td>
<div>
<span class="apple-font" id="002">
</div>
</td>
</tr>
<tr>
</tr>
<tr>
<td>
<a vid="0099883"></a>
</td>
</tr>
<tr>
<td>
<div>
<span class="apple-font" id="003">
</div>
</td>
</tr>
<tr>
</tr>
<tr>
<td>
<a vid="0099883"></a>
</td>
</tr>
<tr>
<td>
<a vid="0099883"></a>
</td>
</tr>
<tr>
<td>
<div>
<span class="apple-font" id="004">
</div>
</td>
</tr>
<tr>
</tr>
以下是我正在尝试使用的代码,但在确定获取所有视频时尚未取得很大进展:
soup = soup.findAll(class_="apple-font", id=True)
for s in soup:
n = str(s.get_text().lstrip().replace(".",""))
print n
print
答案 0 :(得分:1)
我会使用迭代方法;循环遍历同一个表格中的所有tr
元素,从第一个<span class="apple-font">
标记开始,每次找到包含新id
的行时都会开始一个新的组:
table = soup.find(class_='apple-font', id=True).find_parent('table')
groups = {}
group = None
for tr in table.find_all('tr'):
id_span = tr.find(class_='apple-font', id=True)
if id_span is not None:
# new group
group = []
groups[id_span['id']] = group
else:
vid_link = tr.find('a', vid=True)
if vid_link is not None:
group.append(vid_link['vid'])
演示:
>>> from bs4 import BeautifulSoup
>>> sample = '''\
... <tr>
... <td>
... <div>
... <span class="apple-font" id="001">
... </div>
... </td>
... </tr>
...
... <tr>
... </tr>
...
... <tr>
... <td>
... <a vid="0099882"></a>
... </td>
... </tr>
...
... <tr>
... <td>
... <a vid="0099883"></a>
... </td>
... </tr>
...
... <tr>
... <td>
... <a vid="0099883"></a>
... </td>
... </tr>
...
...
... <tr>
... <td>
... <div>
... <span class="apple-font" id="002">
... </div>
... </td>
... </tr>
...
... <tr>
... </tr>
...
... <tr>
... <td>
... <a vid="0099883"></a>
... </td>
... </tr>
...
... <tr>
... <td>
... <div>
... <span class="apple-font" id="003">
... </div>
... </td>
... </tr>
...
... <tr>
... </tr>
...
... <tr>
... <td>
... <a vid="0099883"></a>
... </td>
... </tr>
...
... <tr>
... <td>
... <a vid="0099883"></a>
... </td>
... </tr>
...
... <tr>
... <td>
... <div>
... <span class="apple-font" id="004">
... </div>
... </td>
... </tr>
...
... <tr>
... </tr>
... '''
>>> soup = BeautifulSoup('<table>{}</table>'.format(sample))
>>> table = soup.find(class_='apple-font', id=True).find_parent('table')
>>> groups = {}
>>> group = None
>>> for tr in table.find_all('tr'):
... id_span = tr.find(class_='apple-font', id=True)
... if id_span is not None:
... # new group
... group = []
... groups[id_span['id']] = group
... else:
... vid_link = tr.find('a', vid=True)
... if vid_link is not None:
... group.append(vid_link['vid'])
...
>>> print groups
{'003': ['0099883', '0099883'], '002': ['0099883'], '001': ['0099882', '0099883', '0099883'], '004': []}