我正在尝试使用Beautiful Soup在ajax页面上刮一张桌子,然后用表格形式用TextTable库打印出来。
import BeautifulSoup
import urllib
import urllib2
import getpass
import cookielib
import texttable
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
...
def show_queue():
url = 'https://www.animenfo.com/radio/nowplaying.php'
values = {'ajax' : 'true', 'mod' : 'queue'}
data = urllib.urlencode(values)
f = opener.open(url, data)
soup = BeautifulSoup.BeautifulSoup(f)
stable = soup.find('table')
table = texttable.Texttable()
header = stable.findAll('th')
header_text = []
for th in header:
header_append = th.find(text=True)
header.append(header_append)
table.header(header_text)
rows = stable.find('tr')
for tr in rows:
cells = []
cols = tr.find('td')
for td in cols:
cells_append = td.find(text=True)
cells.append(cells_append)
table.add_row(cells)
s = table.draw
print s
...
虽然代码中显示了我正在尝试抓取的HTML的URL,但这里有一个示例:
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<th>Artist - Title</th>
<th>Album</th>
<th>Album Type</th>
<th>Series</th>
<th>Duration</th>
<th>Type of Play</th>
<th>
<span title="...">Time to play</span>
</th>
</tr>
<tr>
<td class="row1">
<a href="..." class="songinfo">Song 1</a>
</td>
<td class="row1">
<a href="..." class="album_link">Album 1</a>
</td>
<td class="row1">...</td>
<td class="row1">
</td>
<td class="row1" style="text-align: center">
5:43
</td>
<td class="row1" style="padding-left: 5px; text-align: center">
S.A.M.
</td>
<td class="row1" style="text-align: center">
~0:00:00
</td>
</tr>
<tr>
<td class="row2">
<a href="..." class="songinfo">Song2</a>
</td>
<td class="row2">
<a href="..." class="album_link">Album 2</a>
</td>
<td class="row2">...</td>
<td class="row2">
</td>
<td class="row2" style="text-align: center">
6:16
</td>
<td class="row2" style="padding-left: 5px; text-align: center">
S.A.M.
</td>
<td class="row2" style="text-align: center">
~0:05:43
</td>
</tr>
<tr>
<td class="row1">
<a href="..." class="songinfo">Song 3</a>
</td>
<td class="row1">
<a href="..." class="album_link">Album 3</a>
</td>
<td class="row1">...</td>
<td class="row1">
</td>
<td class="row1" style="text-align: center">
4:13
</td>
<td class="row1" style="padding-left: 5px; text-align: center">
S.A.M.
</td>
<td class="row1" style="text-align: center">
~0:11:59
</td>
</tr>
<tr>
<td class="row2">
<a href="..." class="songinfo">Song 4</a>
</td>
<td class="row2">
<a href="..." class="album_link">Album 4</a>
</td>
<td class="row2">...</td>
<td class="row2">
</td>
<td class="row2" style="text-align: center">
5:34
</td>
<td class="row2" style="padding-left: 5px; text-align: center">
S.A.M.
</td>
<td class="row2" style="text-align: center">
~0:16:12
</td>
</tr>
<tr>
<td class="row1"><a href="..." class="songinfo">Song 5</a>
</td>
<td class="row1">
<a href="..." class="album_link">Album 5</a>
</td>
<td class="row1">...</td>
<td class="row1"></td>
<td class="row1" style="text-align: center">
4:23
</td>
<td class="row1" style="padding-left: 5px; text-align: center">
S.A.M.
</td>
<td class="row1" style="text-align: center">
~0:21:46
</td>
</tr>
<tr>
<td style="height: 5px;">
</td></tr>
<tr>
<td class="row2" style="font-style: italic; text-align: center;" colspan="5">There are x songs in the queue with a total length of x:y:z.</td>
</tr>
</tbody>
</table>
每当我尝试运行此脚本函数时,它就会在行TypeError: find() takes no keyword arguments
上的header_append = th.find(text=True)
中止。我有点难过,因为看起来我正在做代码示例中显示的内容,它似乎应该可以工作,但事实并非如此。
简而言之,我如何修复代码以便没有TypeError以及我做错了什么?
编辑: 我在编写脚本时提到的文章和文档:
答案 0 :(得分:3)
您收到错误TypeError: find() takes no keyword arguments
的原因是因为您实际上是在字符串上调用find()
。
find
是一个python字符串方法,不带关键字参数。例如:
>>> 'hello'.find('l')
2
>>> 'hello'.find('l', foo='bar')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: find() takes no keyword arguments
beautifulsoup的Tag
也有find
方法,这是您尝试使用的方法。
在您的代码中的某个时刻,当您想要使用Tag时,最终调用了字符串find。
Python使用duck typing,这会在这种情况下造成混淆。
答案 1 :(得分:2)
解析器表现正常。您只是使用相同的表达式来解析不同类型的元素。
这是一个片段,仅关注返回已删除的列表。获得列表后,您可以轻松地格式化文本表:
import BeautifulSoup
def get_queue(data):
# Args:
# data: string, contains the html to be scraped
soup = BeautifulSoup.BeautifulSoup(data)
stable = soup.find('table')
header = stable.findAll('th')
headers = [ th.text for th in header ]
cells = [ ]
rows = stable.findAll('tr')
for tr in rows[1:-2]:
# Process the body of the table
row = []
td = tr.findAll('td')
row.append( td[0].find('a').text )
row.append( td[1].find('a').text )
row.extend( [ td.text for td in td[2:] ] )
cells.append( row )
footer = rows[-1].find('td').text
return headers, cells, footer
headers
,cells
和footer
,现在可以将单元格输入texttable
格式化函数:
import texttable
def show_table(headers, cells, footer):
retval = ''
table = texttable.Texttable()
table.header(headers)
for cell in cells:
table.add_row(cell)
retval = table.draw()
return retval + '\n' + footer
print show_table(headers, cells, footer)
+----------+----------+----------+----------+----------+----------+----------+
| Artist - | Album | Album | Series | Duration | Type of | Time to |
| Title | | Type | | | Play | play |
+==========+==========+==========+==========+==========+==========+==========+
| Song 1 | Album 1 | ... | | 5:43 | S.A.M. | ~0:00:00 |
+----------+----------+----------+----------+----------+----------+----------+
| Song2 | Album 2 | ... | | 6:16 | S.A.M. | ~0:05:43 |
+----------+----------+----------+----------+----------+----------+----------+
| Song 3 | Album 3 | ... | | 4:13 | S.A.M. | ~0:11:59 |
+----------+----------+----------+----------+----------+----------+----------+
| Song 4 | Album 4 | ... | | 5:34 | S.A.M. | ~0:16:12 |
+----------+----------+----------+----------+----------+----------+----------+
| Song 5 | Album 5 | ... | | 4:23 | S.A.M. | ~0:21:46 |
+----------+----------+----------+----------+----------+----------+----------+
There are x songs in the queue with a total length of x:y:z.