我有一个BeautifulSoup对象(网页),我已经磨练了一个感兴趣的HTML段落。它有几个项目,我想清除垃圾(文本以外的任何东西)。
在调用段落的Contents属性(例如paragraph_name.contents)之后,我在列表中有段落中的项目,但需要帮助清除列表中带有HTML标记的项目。
这是列表的样子:
[u'\n',
<span>Early Education Enrollment: 0</span>,
<br/>,
u'\r\n Elementary Enrollment: 231',
<br/>,
u'\r\n Middle School Enrollment: 118',
<br/>,
u'\r\n High School Enrollment: 121',
<br/>,
u'\r\n Total Enrollment: 470',
<br/>,
u'\n',
<span>I20: True</span>,
<br/>,
u'\r\n Grade Levels: K - 12',
<br/>,
u'\r\n Year Founded: 1999',
<br/>,
u'\n',
<span>Other Accreditation: AdvancED, SACS</span>,
u'\n']
以下是重新创建我在您的计算机上遇到的确切问题所需的所有代码,以便我们能够真正解决同样的问题:
from bs4 import BeautifulSoup as BS
# the sample html as a BeautifulSoup Object:
soup = BS('<p>\n<span>Early Education Enrollment: 0</span><br/>\r\n Elementary Enrollment: 231<br/>\r\n Middle School Enrollment: 118<br/>\r\n High School Enrollment: 121<br/>\r\n Total Enrollment: 470<br/>\n<span>I20: True</span><br/>\r\n Grade Levels: K - 12<br/>\r\n Year Founded: 1999<br/>\n<span>Other Accreditation: AdvancED, SACS</span>\n</p>', 'lxml')
# hone in on the paragraph I want to parse through:
target_p = soup.find('p')
# organize paragraph items into a list, although including junk for now:
dirty_list = target_p.contents
# clean up list using method I need help with:
clean_list =
我认为列表推导是要走的路,但无法弄清楚如何磨练html标签。这不起作用:
clean_list = [x for x in dirty_list if x != '<br/>']
谢谢!
答案 0 :(得分:0)
这取决于您要查找的输出格式。使用soup.p.text
将为您提供所有文本,但这将包括前导空格。 Python可用于将文本拆分为行,并从每行中删除额外的空间。如果需要,可以将它们连接在一起:
from bs4 import BeautifulSoup as BS
soup = BS('<p>\n<span>Early Education Enrollment: 0</span><br/>\r\n Elementary Enrollment: 231<br/>\r\n Middle School Enrollment: 118<br/>\r\n High School Enrollment: 121<br/>\r\n Total Enrollment: 470<br/>\n<span>I20: True</span><br/>\r\n Grade Levels: K - 12<br/>\r\n Year Founded: 1999<br/>\n<span>Other Accreditation: AdvancED, SACS</span>\n</p>', 'lxml')
print '\n'.join(line.strip() for line in soup.p.text.split('\n'))
给你:
Early Education Enrollment: 0
Elementary Enrollment: 231
Middle School Enrollment: 118
High School Enrollment: 121
Total Enrollment: 470
I20: True
Grade Levels: K - 12
Year Founded: 1999
Other Accreditation: AdvancED, SACS