在循环中查找漂亮的汤返回TypeError

时间:2012-08-01 09:06:41

标签: python web-scraping beautifulsoup html-table html-parsing

我正在尝试使用Beautiful Soup在ajax页面上刮一张桌子,然后用表格形式用TextTable库打印出来。

import BeautifulSoup
import urllib
import urllib2
import getpass
import cookielib
import texttable

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)

...

def show_queue():
    url = 'https://www.animenfo.com/radio/nowplaying.php'
    values = {'ajax' : 'true', 'mod' : 'queue'}
    data = urllib.urlencode(values)
    f = opener.open(url, data)
    soup = BeautifulSoup.BeautifulSoup(f)
    stable = soup.find('table')
    table = texttable.Texttable()
    header = stable.findAll('th')
    header_text = []
    for th in header:
        header_append = th.find(text=True)
        header.append(header_append)
    table.header(header_text)
    rows = stable.find('tr')
    for tr in rows:
        cells = []
        cols = tr.find('td')
        for td in cols:
            cells_append = td.find(text=True)
            cells.append(cells_append)
        table.add_row(cells)
    s = table.draw
    print s

...

虽然代码中显示了我正在尝试抓取的HTML的URL,但这里有一个示例:

<table cellspacing="0" cellpadding="0">
    <tbody>
        <tr>
                        <th>Artist - Title</th>
            <th>Album</th>
            <th>Album Type</th>
            <th>Series</th>
            <th>Duration</th>
            <th>Type of Play</th>
            <th>
                <span title="...">Time to play</span>
            </th>
                    </tr>
                <tr>
                        <td class="row1">
                <a href="..." class="songinfo">Song 1</a>
            </td>
            <td class="row1">
                <a href="..." class="album_link">Album 1</a>
            </td>
            <td class="row1">...</td>
            <td class="row1">

            </td>
            <td class="row1" style="text-align: center">
                5:43
            </td>
            <td class="row1" style="padding-left: 5px; text-align: center">
                                    S.A.M.
                            </td>
            <td class="row1" style="text-align: center">
                ~0:00:00
            </td>
                    </tr>
                <tr>
                        <td class="row2">
                <a href="..." class="songinfo">Song2</a>
            </td>
            <td class="row2">
                <a href="..." class="album_link">Album 2</a>
            </td>
            <td class="row2">...</td>
            <td class="row2">

            </td>
            <td class="row2" style="text-align: center">
                6:16
            </td>
            <td class="row2" style="padding-left: 5px; text-align: center">
                                    S.A.M.
                            </td>
            <td class="row2" style="text-align: center">
                ~0:05:43
            </td>
                    </tr>
                <tr>
                        <td class="row1">
                <a href="..." class="songinfo">Song 3</a>
            </td>
            <td class="row1">
                <a href="..." class="album_link">Album 3</a>
            </td>
            <td class="row1">...</td>
            <td class="row1">

            </td>
            <td class="row1" style="text-align: center">
                4:13
            </td>
            <td class="row1" style="padding-left: 5px; text-align: center">
                                    S.A.M.
                            </td>
            <td class="row1" style="text-align: center">
                ~0:11:59
            </td>
                    </tr>
                <tr>
                        <td class="row2">
                <a href="..." class="songinfo">Song 4</a>
            </td>
            <td class="row2">
                <a href="..." class="album_link">Album 4</a>
            </td>
            <td class="row2">...</td>
            <td class="row2">

            </td>
            <td class="row2" style="text-align: center">
                5:34
            </td>
            <td class="row2" style="padding-left: 5px; text-align: center">
                                    S.A.M.
                            </td>
            <td class="row2" style="text-align: center">
                ~0:16:12
            </td>
                    </tr>
                <tr>
                        <td class="row1"><a href="..." class="songinfo">Song 5</a>

            </td>
            <td class="row1">
                <a href="..." class="album_link">Album 5</a>
            </td>
            <td class="row1">...</td>
            <td class="row1"></td>
            <td class="row1" style="text-align: center">
                4:23
            </td>
            <td class="row1" style="padding-left: 5px; text-align: center">
                                    S.A.M.
                            </td>
            <td class="row1" style="text-align: center">
                ~0:21:46
            </td>
                    </tr>
                <tr>
            <td style="height: 5px;">
        </td></tr>
        <tr>
            <td class="row2" style="font-style: italic; text-align: center;" colspan="5">There are x songs in the queue with a total length of x:y:z.</td>
        </tr>
    </tbody>
</table>

每当我尝试运行此脚本函数时,它就会在行TypeError: find() takes no keyword arguments上的header_append = th.find(text=True)中止。我有点难过,因为看起来我正在做代码示例中显示的内容,它似乎应该可以工作,但事实并非如此。

简而言之,我如何修复代码以便没有TypeError以及我做错了什么?

编辑: 我在编写脚本时提到的文章和文档:

2 个答案:

答案 0 :(得分:3)

您收到错误TypeError: find() takes no keyword arguments的原因是因为您实际上是在字符串上调用find()

string find

find是一个python字符串方法,不带关键字参数。例如:

>>> 'hello'.find('l')
2
>>> 'hello'.find('l', foo='bar')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: find() takes no keyword arguments

beautifulsoup找到

beautifulsoup的Tag也有find方法,这是您尝试使用的方法。

底线

在您的代码中的某个时刻,当您想要使用Tag时,最终调用了字符串find。

Python使用duck typing,这会在这种情况下造成混淆。

答案 1 :(得分:2)

基本问题

解析器表现正常。您只是使用相同的表达式来解析不同类型的元素。

修订代码

这是一个片段,仅关注返回已删除的列表。获得列表后,您可以轻松地格式化文本表:

import BeautifulSoup

def get_queue(data):
    # Args:
    #   data: string, contains the html to be scraped
    soup = BeautifulSoup.BeautifulSoup(data)
    stable = soup.find('table')

    header = stable.findAll('th')
    headers = [ th.text for th in header ]

    cells = [ ]
    rows = stable.findAll('tr')
    for tr in rows[1:-2]:
        # Process the body of the table
        row = []
        td = tr.findAll('td')
        row.append( td[0].find('a').text )
        row.append( td[1].find('a').text )
        row.extend( [ td.text for td in td[2:] ] )
        cells.append( row )

    footer = rows[-1].find('td').text
    return headers, cells, footer

输出

headerscellsfooter,现在可以将单元格输入texttable格式化函数:

import texttable
def show_table(headers, cells, footer):
    retval = ''
    table = texttable.Texttable()
    table.header(headers)
    for cell in cells:
        table.add_row(cell)
    retval = table.draw()
    return retval + '\n' + footer

print show_table(headers, cells, footer)
+----------+----------+----------+----------+----------+----------+----------+
| Artist - |  Album   |  Album   |  Series  | Duration | Type of  | Time to  |
|  Title   |          |   Type   |          |          |   Play   |   play   |
+==========+==========+==========+==========+==========+==========+==========+
| Song 1   | Album 1  | ...      |          | 5:43     | S.A.M.   | ~0:00:00 |
+----------+----------+----------+----------+----------+----------+----------+
| Song2    | Album 2  | ...      |          | 6:16     | S.A.M.   | ~0:05:43 |
+----------+----------+----------+----------+----------+----------+----------+
| Song 3   | Album 3  | ...      |          | 4:13     | S.A.M.   | ~0:11:59 |
+----------+----------+----------+----------+----------+----------+----------+
| Song 4   | Album 4  | ...      |          | 5:34     | S.A.M.   | ~0:16:12 |
+----------+----------+----------+----------+----------+----------+----------+
| Song 5   | Album 5  | ...      |          | 4:23     | S.A.M.   | ~0:21:46 |
+----------+----------+----------+----------+----------+----------+----------+
There are x songs in the queue with a total length of x:y:z.