漂亮的汤提取父/兄弟tr表类

时间:2014-09-05 14:08:28

标签: python html beautifulsoup

我正在尝试使用bs4,但是,我在从以下html中提取一些信息时遇到了一些麻烦:

<table border="1" cellspacing="0" class="browser">
<thead>..</thead>
<tbody class="body">
<tr class="date">..</tr>
<tr class="right">..</tr>
<tr class="right">..</tr>
<tr class="right">..</tr>
<tr class="date">..</tr>
<tr class="right">..</tr>
<tr class="right">..</tr>
<tr class="right">..</tr>
</tbody>
</table>

所以,我想要的是两个class之间的内容(date classes),如下所示:

<tr class="date">..</tr>
<tr class="right">..</tr>
<tr class="right">..</tr>
<tr class="right">..</tr>

<tr class="date">..</tr>
<tr class="right">..</tr>
<tr class="right">..</tr>
<tr class="right">..</tr>

我尝试过:

xx = soup.find_all('tbody',{'class':'body'})

并获取相应的right classes我这样做:

yy = []
for i in xx:
    yy.append( i.find_all('tr',{'class':'right'}) )

...但这给了我所有的right classes,但我想知道date中每个元素的父yy类是什么。简而言之,我希望每个right classes与其parent date class

相关联

如果问题似乎令人困惑,请提前抱歉!

2 个答案:

答案 0 :(得分:1)

您必须遍历tbody标记的子项。这将有效:

# Get just the tags
tags = filter( lambda x: x != '\n', soup.tbody.contents)
collected_tags = []
latest_date = None
for tag in tags:
    if tag['class'] == ['date']:
        date_map = {tag: []}
        collected_tags.append(date_map)
        latest_date = tag
        continue
    if collected_tags and tag['class'] == ['right']:
        collected_tags[-1][latest_date].append(tag)

```

collected_tags现在是将date代码映射到right代码的词典列表。

答案 1 :(得分:0)

您可以迭代next_siblings,直到找到一个date作为类的内容:

for date_row in soup.select('table tbody.body tr.date'):
    for elem in date_row.next_siblings:
        if not elem.name:
            # NavigableString (text) element between rows
            continue
        if 'right' not in elem.get('class', []):
            # all done, found a row that doesn't have class="right"
            break

您可以将这些收集到一个列表中,或者只是在那里循环处理它们。

演示:

>>> for date_row in soup.select('table tbody.body tr.date'):
...     print('Found a date row', date_row)
...     for elem in date_row.next_siblings:
...         if not elem.name:
...             # NavigableString (text) element between rows
...             continue
...         if 'right' not in elem.get('class', []):
...             # all done, found a row that doesn't have class="right"
...             break
...         print('Right row grouped with the date', elem)
...     print()
... 
Found a date row <tr class="date">..</tr>
Right row grouped with the date <tr class="right">..</tr>
Right row grouped with the date <tr class="right">..</tr>
Right row grouped with the date <tr class="right">..</tr>

Found a date row <tr class="date">..</tr>
Right row grouped with the date <tr class="right">..</tr>
Right row grouped with the date <tr class="right">..</tr>
Right row grouped with the date <tr class="right">..</tr>