Question

尝试匹配包含特定文本的链接。我在做

links = soup.find_all('a',href=lambda x: ".org" in x)

但抛出TypeError：类型为'NoneType'的参数不可迭代。

正确的做法是

links = soup.find_all('a',href=lambda x: x and ".org" in x)

为什么这里需要额外的x and？

Answer 1

原因很简单：HTML中的<a>个代码之一没有href属性。

这是一个重现异常的最小例子：

html = '<html><body><a>bar</a></body></html>'
soup = BeautifulSoup(html, 'html.parser')

links = soup.find_all('a', href=lambda x: ".org" in x)
# result:
# TypeError: argument of type 'NoneType' is not iterable

现在，如果我们添加href属性，则异常消失：

html = '<html><body><a href="foo.org">bar</a></body></html>'
soup = BeautifulSoup(html, 'html.parser')

links = soup.find_all('a', href=lambda x: ".org" in x)
# result:
# [<a href="foo.org">bar</a>]

正在发生的事情是，BeautifulSoup正在尝试访问<a>代码的href媒体资源，并且当该资产不属于该资产时会返回None存在：

html = '<html><body><a>bar</a></body></html>'
soup = BeautifulSoup(html, 'html.parser')

print(soup.a.get('href'))
# output: None

这就是为什么在lambda中允许None值所必需的原因。由于None是一个假值，代码x and ...会阻止and语句的右侧在x为None时执行，如您所见这里：

>>> None and 1/0
>>> 'foo.org' and 1/0
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ZeroDivisionError: division by zero

这称为short-circuiting。

尽管如此，x and ...会检查x的真实性，而None并不是唯一被认为是假的价值。因此，将x与None进行比较更为正确：

lambda x: x is not None and ".org" in x

Answer 2

额外的x可以避免您遇到的问题，即TypeError: argument of type 'NoneType'。尝试使用None作为参数调用lambda函数：

>>> f = lambda x: ".org" in x
>>> f
<function <lambda> at 0x7f5dd1215ea0>
>>> f(None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <lambda>
TypeError: argument of type 'NoneType' is not iterable
>>> f('abcd.org/blah')
True

x中的第一个x and ".org" in x测试x是否为“真实”。如果是，则评估表达式的其余部分。如果它不是“真实的”，例如它是None，那么and表达式的第二部分被短路而不被执行。这样可以避免尝试执行in操作，从而避免出现问题。

Answer 3

问题是<a ...>标记可能没有href=...部分，在这种情况下，您会得到None（不能与in运算符一起使用）。

在美丽的汤中使用lambda功能

3 个答案: