Question

我正在使用BeautifulSoup4从网站上提取有关课程产品的数据。

我正在尝试仅从<p>元素中提取课程说明。

当我跑步时：

course_descriptions = soup.findAll("p")

我得到：

<p><b>INFO 101 Social Networking Technologies (5) I&amp;S/NW</b><br/>Explores today's most 
popular social networks, gaming applications, and messaging applications. Examines 
technologies, social implications, and information structure. Focuses on logic, databases, 
networked delivery, identity, access, privacy, ecommerce, organization, and retrieval.
<br/><a href="https://uwstudent.washington.edu/course/#/courses/INFO101" target="_blank">
View course details in MyPlan: INFO 101</a></p>,
<p><b>INFO 102 Gender and Information Technology (5) I&amp;S, DIV</b><br/>Explores the social 
construction of gender in relation to the history and contemporary development of 
information technologies. Considers the importance of diversity and difference in the 
design and construction of innovative information technology solutions. Challenges 
prevailing viewpoints about who can and does work in the information technology field. 
Offered: A.<br/><a href="https://uwstudent.washington.edu/course/#/courses/INFO102" 
target="_blank">View course details in MyPlan: INFO 102</a></p>,

我想获得这些结果，但没有<b></b>标记中的内容。如何从结果中排除它们？

Answer 1

获得Course_descriptions之后，您可以迭代p标签，然后使用decompose删除标签。

text = list()
for item in course_descriptions:
    # some p tags could not have b tags at all.
    try:
        item.b.decompose()
    except:
        pass
    text.append(item.text)

列表文本将仅包含p标记内的内容。希望对您有所帮助。

如何忽略beautifulsoup4中的子元素

1 个答案: