我正在使用BeautifulSoup4从网站上提取有关课程产品的数据。
我正在尝试仅从<p>
元素中提取课程说明。
当我跑步时:
course_descriptions = soup.findAll("p")
我得到:
<p><b>INFO 101 Social Networking Technologies (5) I&S/NW</b><br/>Explores today's most
popular social networks, gaming applications, and messaging applications. Examines
technologies, social implications, and information structure. Focuses on logic, databases,
networked delivery, identity, access, privacy, ecommerce, organization, and retrieval.
<br/><a href="https://uwstudent.washington.edu/course/#/courses/INFO101" target="_blank">
View course details in MyPlan: INFO 101</a></p>,
<p><b>INFO 102 Gender and Information Technology (5) I&S, DIV</b><br/>Explores the social
construction of gender in relation to the history and contemporary development of
information technologies. Considers the importance of diversity and difference in the
design and construction of innovative information technology solutions. Challenges
prevailing viewpoints about who can and does work in the information technology field.
Offered: A.<br/><a href="https://uwstudent.washington.edu/course/#/courses/INFO102"
target="_blank">View course details in MyPlan: INFO 102</a></p>,
我想获得这些结果,但没有<b></b>
标记中的内容。如何从结果中排除它们?
答案 0 :(得分:0)
获得Course_descriptions之后,您可以迭代p标签,然后使用decompose删除 标签。
text = list()
for item in course_descriptions:
# some p tags could not have b tags at all.
try:
item.b.decompose()
except:
pass
text.append(item.text)
列表文本将仅包含p标记内的内容。希望对您有所帮助。