Question

我想从下面显示的HTML'a'标签中提取“气候8/17/2019 2:00 PM”。我编写了代码，以为将从'a'标记中提取所有文本，然后再使用字符串操作提取所需的子字符串。

<div class="topic">
    <a class="class_a" href="/href_1" data1="" data2="hello" data3="Hi" date="Monday, August 17" time="2:00 PM" topic="climate 8/17/2019 2:00 PM">
            <span>2:00 PM</span>
        <i class="Afternoon"></i>
    </a>
</div>

我运行下面的代码，结果是：

2:00 PM

我还更改了如下所示的行，但没有帮助。 bar = topics.find('a') 至 bar = topics.find('a', {"class": "class_a"})

我检查了bar变量的类型为bs4.element.Tag类（不是字符串）

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://tbd.com')
bs = BeautifulSoup(html.read(), 'html.parser')

topics = bs.findAll("div", {"class": "topic"})
for topic in topics:
    bar = topic.find('a')
    print (bar.text)

Answer 1

如果您已经知道要从中提取文本的元素的类，则可以像任何python dict一样从其属性中获取值：

from bs4 import BeautifulSoup

h = """<div class="topic">
    <a class="class_a" href="/href_1" data1="" data2="hello" data3="Hi" date="Monday, August 17" time="2:00 PM" topic="climate 8/17/2019 2:00 PM">
            <span>2:00 PM</span>
        <i class="Afternoon"></i>
    </a>
</div>"""

soup = BeautifulSoup(h, "lxml")
obj = soup.find('a', class_ = "class_a")

print(obj.get('topic'))
#climate 8/17/2019 2:00 PM

Answer 2

您要提取topic属性的值，因此应从字典中将其作为键进行访问：

print(bar['topic'])

Answer 3

您应该获取属性主题的值，而不是如下所示的锚文本：

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://tbd.com')
bs = BeautifulSoup(html.read(), 'html.parser')

topics = bs.findAll("div", {"class": "topic"})
for topic in topics:
    bar = topic.find('a')
    print (bar.get('topic'))

Answer 4

我认为您的主要问题是，您在循环内指定了“主题”（复数），但想要“主题”（单数）。

# python3 bs_test.py

from urllib.request import urlopen
from bs4 import BeautifulSoup
# html = urlopen('https://tbd.com')
html = """
<div class="topic">
    <a class="class_a" href="/href_1" data1="" data2="hello" data3="Hi" date="Monday, August 17" time="2:00 PM" topic="climate 8/17/2019 2:00 PM">
            <span>2:00 PM</span>
        <i class="Afternoon"></i>
    </a>
</div>
"""


# bs = BeautifulSoup(html.read(), 'html.parser')
bs = BeautifulSoup(html, 'html.parser')

topics = bs.findAll("div", {"class": "topic"})
for topic in topics:
    bar = topic.find('a')
    print (bar['topic'])

从包含其他标签的“ a”标签中提取文本的漂亮汤问题

4 个答案: