使用美丽的汤从多个文本中提取文本

时间:2014-09-29 05:54:42

标签: python html beautifulsoup

目标是输出课程名称及其成绩的字典:

<tr>
<td class="course"><a href="/courses/1292/grades/5610">Modern Europe &amp; the World - Dewey</a></td>
<td class="percent">
    92%
</td>
<td style="display: none;"><a href="#" title="Send a Message to the Teacher" class="no-hover"><img alt="Email" src="/images/email.png?1395938788" /></a></td>
</tr>

到此:

{Modern Europe &amp; the World - Dewey: 92%, the next couse name: grade...etc}

我知道如何找到百分比标签或只是一个href标签,但我不确定如何获取文本并将其编译成字典,以便它更有用。谢谢!

2 个答案:

答案 0 :(得分:1)

试试这个:
对于每个tr元素,尝试找到孩子你需要的东西(coursepercent班的人)如果两者都有存在,然后构建grades字典

>>> from bs4 import BeautifulSoup
>>> html = """
... <tr>
... <td class="course"><a href="/courses/1292/grades/5610">Modern Europe &amp; the World - Dewey</a></td>
... <td class="percent">
...     92%
... </td>
... <td style="display: none;"><a href="#" title="Send a Message to the Teacher" class="no-hover"><img alt="Email" src="/images/email.png?1395938788" /></a></td>
... </tr>
... """
>>> 
>>> soup = BeautifulSoup(html)
>>> grades  = {}
>>> for tr in soup.find_all('tr'):
...     td_course  = tr.find("td", {"class" : "course"})
...     td_percent = tr.find("td", {"class" : "percent"})
...     if td_course and td_percent:
...         grades[td_course.text.strip()] = td_percent.text.strip()
... 
>>> 
>>> grades
{u'Modern Europe & the World - Dewey': u'92%'}

答案 1 :(得分:1)

由于每个tr都包含一系列包含所需信息的td元素,因此您只需使用find_all()将它们收集到列表中,然后提取所需信息:

from bs4 import BeautifulSoup

soup = BeautifulSoup("""
<tr>
<td class="course"><a href="/courses/1292/grades/5610">Modern Europe &amp; the World - Dewey</a></td>
<td class="percent">
    92%
</td>
<td style="display: none;"><a href="#" title="Send a Message to the Teacher" class="no-hover"><img alt="Email" src="/images/email.png?1395938788" /></a></td>
</tr>
""")

grades = {}

for tr in soup.find_all("tr"):
    td_text = [td.text.strip() for td in tr.find_all("td")]
    grades[td_text[0]] = td_text[1]

结果:

>>> grades
{u'Modern Europe & the World - Dewey': u'92%'}