从包含其他标签的“ a”标签中提取文本的漂亮汤问题

时间:2019-10-10 18:51:37

标签: python beautifulsoup

我想从下面显示的HTML'a'标签中提取“气候8/17/2019 2:00 PM”。 我编写了代码,以为将从'a'标记中提取所有文本,然后再使用字符串操作提取所需的子字符串。

<div class="topic">
    <a class="class_a" href="/href_1" data1="" data2="hello" data3="Hi" date="Monday, August 17" time="2:00 PM" topic="climate 8/17/2019 2:00 PM">
            <span>2:00 PM</span>
        <i class="Afternoon"></i>
    </a>
</div>

我运行下面的代码,结果是:

2:00 PM

我还更改了如下所示的行,但没有帮助。          bar = topics.find('a')          至          bar = topics.find('a', {"class": "class_a"})

我检查了bar变量的类型为bs4.element.Tag类(不是字符串)

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://tbd.com')
bs = BeautifulSoup(html.read(), 'html.parser')

topics = bs.findAll("div", {"class": "topic"})
for topic in topics:
    bar = topic.find('a')
    print (bar.text)

4 个答案:

答案 0 :(得分:3)

如果您已经知道要从中提取文本的元素的类,则可以像任何python dict一样从其属性中获取值:

from bs4 import BeautifulSoup

h = """<div class="topic">
    <a class="class_a" href="/href_1" data1="" data2="hello" data3="Hi" date="Monday, August 17" time="2:00 PM" topic="climate 8/17/2019 2:00 PM">
            <span>2:00 PM</span>
        <i class="Afternoon"></i>
    </a>
</div>"""

soup = BeautifulSoup(h, "lxml")
obj = soup.find('a', class_ = "class_a")

print(obj.get('topic'))
#climate 8/17/2019 2:00 PM

答案 1 :(得分:1)

您要提取topic属性的值,因此应从字典中将其作为键进行访问:

print(bar['topic'])

答案 2 :(得分:1)

您应该获取属性主题的值,而不是如下所示的锚文本:

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://tbd.com')
bs = BeautifulSoup(html.read(), 'html.parser')

topics = bs.findAll("div", {"class": "topic"})
for topic in topics:
    bar = topic.find('a')
    print (bar.get('topic'))

答案 3 :(得分:1)

我认为您的主要问题是,您在循环内指定了“主题”(复数),但想要“主题”(单数)。

# python3 bs_test.py

from urllib.request import urlopen
from bs4 import BeautifulSoup
# html = urlopen('https://tbd.com')
html = """
<div class="topic">
    <a class="class_a" href="/href_1" data1="" data2="hello" data3="Hi" date="Monday, August 17" time="2:00 PM" topic="climate 8/17/2019 2:00 PM">
            <span>2:00 PM</span>
        <i class="Afternoon"></i>
    </a>
</div>
"""


# bs = BeautifulSoup(html.read(), 'html.parser')
bs = BeautifulSoup(html, 'html.parser')

topics = bs.findAll("div", {"class": "topic"})
for topic in topics:
    bar = topic.find('a')
    print (bar['topic'])