Question

我希望学习美丽的汤，并尝试从页面http://www.popsci.com中提取所有链接...但我收到语法错误。

此代码应该可以使用，但它不适用于我尝试过的任何页面。我试图找出它为什么不起作用。

这是我的代码：

from BeautifulSoup import BeautifulSoup
import urllib2

url="http://www.popsci.com/"

page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

sci=soup.findAll('a')

for eachsci in sci:
    print eachsci['href']+","+eachsci.string

...这是我得到的错误：

Traceback (most recent call last):
  File "/root/Desktop/3.py", line 12, in <module>
    print eachsci['href']+","+eachsci.string
TypeError: coercing to Unicode: need string or buffer, NoneType found
[Finished in 1.3s with exit code 1]

Answer 1

当a元素不包含任何文字时，eachsci.string为None - 您无法使用None运算符将+与字符串连接，正如你想做的那样。

如果您将eachsci.string替换为eachsci.text，则会解决该错误，因为eachsci.text在''元素为空时包含空字符串a，并且将其与另一个字符串连接没有问题。

但是，当您点击没有a属性的href元素时，您会遇到另一个问题 - 如果发生这种情况，您将获得KeyError。

您可以使用dict.get()来解决这个问题，如果某个键不在字典中（a元素假装是字典，则可以返回默认值，这样可行）

将所有这些放在一起，这是您的for循环的变体：

for eachsci in sci:
    print eachsci.get('href', '[no href found]') + "," + eachsci.text

为什么我的链接提取不起作用？

1 个答案: