Question

我正在尝试从以下网页中提取文字：

this.database.list.function()

我试过了：

<div class="MYCLASS">Category1: <a id=category1 href="SomeURL" >
Text1 I want</a> &gt; Category2: <a href="SomeURL" >Text2 I want</a></div>

它又回来了：

for div in soup.find_all('div', class_='MYCLASS'):
    for url in soup.find_all('a', id='category1'):
        print(url)

所以我分开了文字......

    <a href="someURL" id="category1">Text1 I want</a>

并提取“我想要的Text1”，但仍然错过了“我想要的Text2”。任何的想法？谢谢。

修改

还有其他＆lt;一个＆GT; ＆LT; / A＆GT;在源代码中，如果我从代码中删除for div in soup.find_all('div', class_='MYCLASS'): for url in soup.find_all('a', id='category1'): category1 = str(url).split('category1">')[1].split('</a>')[0] print(category1)，它将返回我不需要的所有其他文本。例如，

id=

此外，

<div class="MYClass"><span class="Class">RandomText.<br>RandomText.<br>
<a href=someURL>RandomTextExtracted.</a><br>

Answer 1

由于元素的id是唯一的，因此您可以使用<a>找到第一个id="category1"代码。要查找下一个<a>代码，您可以使用find_next()方法。

html = '''<div class="MYCLASS">Category1: <a id=category1 href="SomeURL" >Text1 I want</a> &gt; Category2: <a href="SomeURL" >Text2 I want</a></div>'''
soup = BeautifulSoup(html, 'lxml')

a_tag1 = soup.find('a', id='category1')
print(a_tag1)    # or use `a_tag1.text` to get the text
a_tag2 = a_tag1.find_next('a')
print(a_tag2)

输出：

<a href="SomeURL" id="category1">Text1 I want</a>
<a href="SomeURL">Text2 I want</a>

^{（我已经对你提供的链接进行了测试，它也适用于那里。）}

Answer 2

你需要一点代码

from bs4 import BeautifulSoup
soup = BeautifulSoup("<div class=\"MYCLASS\">Category1: <a id=category1 href=\"SomeURL\" > \
Text1 I want</a> &gt; Category2: <a href=\"SomeURL\" >Text2 I want</a></div> \
I","lxml")
for div in soup.find_all('div', class_='MYCLASS'):
    for url in soup.find_all('a'):
        print(url.text.strip())

删除＆＃39; a＆＃39;标记并运行相同的代码。

如果您需要指定ID的文本，则需要知道ID。

ids = [id1,id2]
for div in soup.find_all('div', class_='MYCLASS'):
    for id in ids:
        for url in soup.find_all('a',id=id):
            print(url.text.strip())

BeautifulSoup - 在一个类中提取文本

2 个答案: