Question

我正在尝试使用BeautifulSoup从网址列表中的<td>标记中找到包含字符串/子字符串的网址。如果存在完整的字符串但子字符串失败，则它可以正常工作。这是我到目前为止编写的代码：

for url in urls:
    r = requests.get(url, allow_redirects=False)
    soup = BeautifulSoup(r.content, 'lxml')
    words = soup.find_all("td", text=the_word)
    print(words)
    print(url)

我不太了解。有人可以指导我搜索子串吗？

Answer 1

您可以使用custom function检查文本中是否存在字。

html = '''
<td>the keyword is present in the text</td>
<td>the keyword</td>
<td></td>
<td>the word is not present in the text</td>'''

soup = BeautifulSoup(html, 'lxml')
the_word = 'keyword'
tags = soup.find_all('td', text=lambda t: t and the_word in t)
print(tags)
# [<td>the keyword is present in the text</td>, <td>the keyword</td>]

通常只有the_word in t才有效。但是，如果有任何<td>标记没有任何文字，如示例（<td></td>）所示，使用the_word in t会引发TypeError: argument of type 'NoneType' is not iterable。这就是为什么我们首先要检查文本是否不是None。因此函数lambda t: t and the_word in t。

如果您对lambda不满意，可以使用与上述功能相同的简单功能：

def contains_word(t):
    return t and 'keyword' in t

tags = soup.find_all('td', text=contains_word)

Answer 2

没有办法直接这样做。我能想到的唯一方法就是把所有文字都放在＆＃39; td＆＃39;标记到数据结构，如列表或字典，并在那里测试。

使用BeautifulSoup

2 个答案: