Question

我正在尝试从网站上抓取文本，但具体是仅抓取与两个特定链接之一链接的文本，然后再抓取紧随其后的另一个文本字符串。

第二个文本字符串很容易被抓取，因为它包含了我可以定位的唯一类，因此我已经可以使用它了，但是我无法成功地抓取第一个文本（使用两个特定的一个链接）。

我找到了一个SO问题（Find specific link w/ beautifulsoup），并尝试实现该问题的变体，但无法使其正常工作。

这是我要抓取的HTML代码的一小段。在我抓取的每一页中，这种模式反复出现：

<em>[<a href="forum.php?mod=forumdisplay&fid=191&amp;filter=typeid&amp;typeid=19">女孩</a>]</em> <a href="thread-443414-1-1.html" onclick="atarget(this)" class="s xst">寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲  黄冬冬289179</a>

我要抓取然后存储在列表中的两个部分是两个中文文本字符串。

第一个是女孩，意思是女性，是我未能成功刮除的那个。

这总是在以下两个链接之一之前： forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=19（女） forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=15（男）

我已经测试了很多不同的东西，包括：

gender_containers = soup.find_all('a', href = 'forum.php?mod=forumdisplay&fid=191&amp;filter=typeid&amp;typeid=19')

print(gender_containers.get_text())

但是对于我尝试过的所有事情，我都会不断收到类似以下错误：

ResultSet object has no attribute 'get_text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

我认为我没有成功找到这些链接来获取文本，但是到目前为止，我的基本Python技能使我无法弄清楚如何实现它。

我最终想要发生的事情是刮每个页面，使这段代码中的两个字符串（女孩和寻找2003年出生2004年失踪贵州省...）

<em>[<a href="forum.php?mod=forumdisplay&fid=191&amp;filter=typeid&amp;typeid=19">女孩</a>]</em> <a href="thread-443414-1-1.html" onclick="atarget(this)" class="s xst">寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲  黄冬冬289179</a>

...被抓取为两个单独的变量，以便我可以将它们作为两个项目存储在列表中，然后迭代到此代码的下一个实例，抓取这两个文本片段并将其存储为另一个列表，依此类推。我正在建立一个列表列表，在该列表中，我希望每个行/嵌套列表都包含两个字符串：性别（女孩或男孩），然后是更长的字符串，它具有更多的变化。

（但是目前，我有工作代码来抓取和存储这些代码，但我只是无法使性别部分正常工作。）

Answer 1

尝试以下代码。

from bs4 import BeautifulSoup
data='''<em>[<a href="forum.php?mod=forumdisplay&fid=191&amp;filter=typeid&amp;typeid=19">女孩</a>]</em> <a href="thread-443414-1-1.html" onclick="atarget(this)" class="s xst">寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲  黄冬冬289179</a>'''
soup=BeautifulSoup(data,'html.parser')
print(soup.select_one('em').text)

输出：

[女孩]

Answer 2

听起来像您可以使用attribute = value css选择器，且$末尾带有运算符

如果每页只能出现一次

soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text

这是假设typeid=19或typeid=15仅出现在目标字符串的末尾。选择器中的两者之间的“，”是允许两者之一匹配。

您还可以按以下方式处理不存在的可能性：

from bs4 import BeautifulSoup
html ='''<em>[<a href="forum.php?mod=forumdisplay&fid=191&amp;filter=typeid&amp;typeid=19">女孩</a>]</em> <a href="thread-443414-1-1.html" onclick="atarget(this)" class="s xst">寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲  黄冬冬289179</a>'''
soup=BeautifulSoup(html,'html.parser')
gender = soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text if soup.select_one("[href$='typeid=19'], [href$='typeid=15']") is not None else 'Not found'
print(gender)

多个值：

genders = [item.text for item in soup.select_one("[href$='typeid=19'], [href$='typeid=15']")]

如何使用BeautifulSoup抓取基于特定链接的文本？

2 个答案: