美丽的汤代码会产生意想不到的结果(已编辑)

时间:2017-01-11 19:54:48

标签: python html beautifulsoup

(问题是根据收到的反馈进行编辑的。我会根据收到的输入继续编辑,直到问题得到解决)

我正在学习Pyhton和特别美丽的汤,我正在使用包含不同年份流行婴儿名称的html文件集(例如baby1990.html等)在Regex上进行谷歌练习。如果您对此感兴趣,可以找到此数据集:https://developers.google.com/edu/python/exercises/baby-names

每个html文件都包含一个包含婴儿姓名数据的表格,如下所示:

enter image description here

在带有婴儿名字的桌子前面还有另一张桌子。两个表的标签中的html代码分别如下

<table width="100%" border="0" cellspacing="0" cellpadding="4"> # Unwanted table
<table width="100%" border="0" cellspacing="0" cellpadding="4" summary="formatting">  # targeted table

您可能会发现目标与不需要的表的区别在于属性:summary =“formatting”

第一个表 - 我们必须跳过的表 - 具有以下HTML代码:

<table width="100%" border="0" cellspacing="0" cellpadding="4">
  <tbody>
  <tr><td class="sstop" valign="bottom" align="left" width="25%">
      Social Security Online
    </td><td valign="bottom" class="titletext">
      <!-- sitetitle -->Popular Baby Names
    </td>
  </tr>
  <tr bgcolor="#333366"><td colspan="2" height="2"></td></tr>
  <tr><td class="graystars" width="25%" valign="top">
       <a href="../OACT/babynames/">Popular Baby Names</a></td><td valign="top"> 
      <a href="http://www.ssa.gov/"><img src="/templateimages/tinylogo.gif"
      width="52" height="47" align="left"
      alt="SSA logo: link to Social Security home page" border="0"></a><a name="content"></a>
      <h1>Popular Names by Birth Year</h1>September 12, 2007</td>
  </tr>
  <tr bgcolor="#333366"><td colspan="2" height="1"></td></tr>
</tbody></table>

在目标表中,代码如下:

<table width="100%" border="0" cellspacing="0" cellpadding="4" summary="formatting">
<tr valign="top"><td width="25%" class="greycell">
<a href="../OACT/babynames/background.html">Background information</a>
<p><br />
&nbsp; Select another <label for="yob">year of birth</label>?<br />      
<form method="post" action="/cgi-bin/popularnames.cgi">
&nbsp; <input type="text" name="year" id="yob" size="4" value="1990">
<input type="hidden" name="top" value="1000">
<input type="hidden" name="number" value="">
&nbsp; <input type="submit" value="   Go  "></form>
</td><td>
<h3 align="center">Popularity in 1990</h3>
<p align="center">
<table width="48%" border="1" bordercolor="#aaabbb"
 cellpadding="2" cellspacing="0" summary="Popularity for top 1000">
<tr align="center" valign="bottom">
<th scope="col" width="12%" bgcolor="#efefef">Rank</th>
<th scope="col" width="41%" bgcolor="#99ccff">Male name</th>
<th scope="col" bgcolor="pink" width="41%">Female name</th></tr>
<tr align="right"><td>1</td><td>Michael</td><td>Jessica</td> # Targeted row
<tr align="right"><td>2</td><td>Christopher</td><td>Ashley</td> # Targeted row
etc...

您可以看到目标行的独特属性是:align =“right”。

现在提取目标细胞内容的代码如下:

with open("C:/Users/ALEX/MyFiles/JUPYTER NOTEBOOKS/google-python-exercises/babynames/baby1990.html","r") \
as f: soup = bs(f.read(), 'html.parser') 

print soup.tr
print "number of elemenents in the soup:" , len(soup)

right_table = soup.find("table", summary = "formatting")

print(right_table.prettify())

print "right_table" , len(right_table)

print(right_table[0].prettify())

for row in right_table[1].find_all("tr", allign = "right"):

     cells = row.find_all("td")

     try:
                            print "cells[0]: " , cells[0]
     except:
                            print "cells[0] : NaN"
     try:
                            print "cells[1]: " , cells[1]
     except:
                            print "cells[1] : NaN"    
     try:
                            print "cells[2]: " , cells[2]
     except:
                            print "cells[2] : NaN"

输出是错误消息:

    <tr><td align="left" class="sstop" valign="bottom" width="25%">
      Social Security Online
    </td><td class="titletext" valign="bottom">
<!-- sitetitle -->Popular Baby Names
    </td>
</tr>
number of elemenents in the soup: 4
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-116-3ec77a65b5ad> in <module>()
      6 right_table = soup.find("table", summary = "formatting")
      7 
----> 8 print(right_table.prettify())
      9 
     10 print "right_table" , len(right_table)

C:\users\alex\Anaconda2\lib\site-packages\bs4\element.pyc in prettify(self, encoding, formatter)
   1198     def prettify(self, encoding=None, formatter="minimal"):
   1199         if encoding is None:
-> 1200             return self.decode(True, formatter=formatter)
   1201         else:
   1202             return self.encode(encoding, True, formatter=formatter)

C:\users\alex\Anaconda2\lib\site-packages\bs4\element.pyc in decode(self, indent_level, eventual_encoding, formatter)
   1164             indent_contents = None
   1165         contents = self.decode_contents(
-> 1166             indent_contents, eventual_encoding, formatter)
   1167 
   1168         if self.hidden:

C:\users\alex\Anaconda2\lib\site-packages\bs4\element.pyc in decode_contents(self, indent_level, eventual_encoding, formatter)
   1233             elif isinstance(c, Tag):
   1234                 s.append(c.decode(indent_level, eventual_encoding,
-> 1235                                   formatter))
   1236             if text and indent_level and not self.name == 'pre':
   1237                 text = text.strip()

... last 2 frames repeated, from the frame below ...

C:\users\alex\Anaconda2\lib\site-packages\bs4\element.pyc in decode(self, indent_level, eventual_encoding, formatter)
   1164             indent_contents = None
   1165         contents = self.decode_contents(
-> 1166             indent_contents, eventual_encoding, formatter)
   1167 
   1168         if self.hidden:

RuntimeError: maximum recursion depth exceeded while calling a Python object

问题如下:

  1. 为什么代码会返回第一个表 - 不需要的表 - 因为我们已经通过了参数summary =“formatting”?

  2. 错误消息的含义是什么?为什么要创建它?

  3. 您可以在代码中看到哪些其他错误 - 如果有的话?

  4. 您的建议将不胜感激。

2 个答案:

答案 0 :(得分:1)

我认为您误读了属性搜索。

如果您正在寻找&#39;摘要等于&#34;人气为前1000&#34;&#39;,您应该使用:

soup.find('table', summary="Popularity for top 1000")

希望这适合你!

答案 1 :(得分:1)

summary_ = "formatting"
allign_ = "right"

删除_,只有class__

  

搜索具有特定CSS类的标记非常有用,但是   CSS属性的名称“class”是Python中的保留字。   使用class作为关键字参数会给出语法错误。作为   美丽的汤4.1.2,你可以使用关键字搜索CSS类   参数class_

with open('/home/li/Downloads/google-python-exercises/babynames/baby2006.html') as f:
    soup = bs4.BeautifulSoup(f, 'lxml')
    table = soup.find(summary="Popularity for top 1000")
    for tr in table.find_all('tr'):
        tds = list(tr.stripped_strings)
        print(tds)

出:

['Rank', 'Male name', 'Female name']
['1', 'Jacob', 'Emily']
['2', 'Michael', 'Emma']
['3', 'Joshua', 'Madison']
['4', 'Ethan', 'Isabella']
['5', 'Matthew', 'Ava']
['6', 'Daniel', 'Abigail']
['7', 'Christopher', 'Olivia']
['8', 'Andrew', 'Hannah']
['9', 'Anthony', 'Sophia']
['10', 'William', 'Samantha']
['11', 'Joseph', 'Elizabeth']