Question

我正在对此页面进行网页抓取http://www.crmz.com/Directory/Industry806.htm，我应该得到所有

＃
公司名称
国家
州/省

但是compnay名称旁边有一个rss链接，所以我没有得到结果并显示一个typeError。

这是我的代码：

#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup

mech = Browser()
url = "http://www.crmz.com/Directory/Industry806.htm"
page = mech.open(url)

html = page.read()
soup = BeautifulSoup(html)
table = soup.find("table", {"border":"0", "cellspacing":"1", "cellpadding":"2"})

rows = table.findAll('tr')
for tr in rows:
    cols = tr.findAll('td')
    for td in cols:
        text = ''.join(td.find(text=True))
        print text+"|",
    print

这是我的输出：

LRI$ python scrape.py

#| Company Name| Country| State/Province|
1.| 1300 Smiles Limited|

Traceback (most recent call last):
  File "scrape.py", line 17, in <module>
    text = ''.join(td.find(text=True))
TypeError

Answer 1

尝试加入文本搜索的None值会导致异常：

>>> [td.find(text=True) for td in rows[6].findAll('td')]
[u'2.', u'1st Dental Laboratories Plc', None, u'United Kingdom', u'&nbsp;']

这里的None是触发异常的原因：

>>> ''.join(None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError

这是因为.find()只会找到第一个文本对象，或者如果没有这样的对象则返回None。您可能打算使用td.findAll(text=True)代替，总是返回一个列表：

for tr in rows:
    cols = tr.findAll('td')
    for td in cols:
        text = ''.join(td.findAll(text=True))
        print text+"|",
    print

或者更好的是，使用tag.getText()方法：

for tr in rows:
    cols = tr.findAll('td')
    if cols:
        print u'|'.join([td.getText() for td in cols])

我强烈建议您使用BeautifulSoup 4; BeautifulSoup 3现在已经超过2年没有看到任何错误修复或其他维护。

您可能还想查看csv module来编写输出。

Answer 2

你应该替换

text = ''.join(td.find(text=True))

与

text = ''.join(td.find(text="True"))

因为text属性的输入是字符串

Python中的TypeError - 美丽的汤

2 个答案: