Python BeautifulSoup从列表中删除带有标签的项目

时间:2017-11-13 06:54:23

标签: python html list beautifulsoup list-comprehension

我有一个BeautifulSoup对象(网页),我已经磨练了一个感兴趣的HTML段落。它有几个项目,我想清除垃圾(文本以外的任何东西)。

在调用段落的Contents属性(例如paragraph_name.contents)之后,我在列表中有段落中的项目,但需要帮助清除列表中带有HTML标记的项目。

这是列表的样子:

[u'\n',
 <span>Early Education Enrollment: 0</span>,
 <br/>,
 u'\r\n        Elementary Enrollment: 231',
 <br/>,
 u'\r\n        Middle School Enrollment: 118',
 <br/>,
 u'\r\n        High School Enrollment: 121',
 <br/>,
 u'\r\n        Total Enrollment: 470',
 <br/>,
 u'\n',
 <span>I20: True</span>,
 <br/>,
 u'\r\n                                                Grade Levels: K - 12',
 <br/>,
 u'\r\n        Year Founded: 1999',
 <br/>,
 u'\n',
 <span>Other Accreditation: AdvancED, SACS</span>,
 u'\n']

以下是重新创建我在您的计算机上遇到的确切问题所需的所有代码,以便我们能够真正解决同样的问题:

from bs4 import BeautifulSoup as BS
# the sample html as a BeautifulSoup Object:
soup = BS('<p>\n<span>Early Education Enrollment: 0</span><br/>\r\n        Elementary Enrollment: 231<br/>\r\n        Middle School Enrollment: 118<br/>\r\n        High School Enrollment: 121<br/>\r\n        Total Enrollment: 470<br/>\n<span>I20: True</span><br/>\r\n                                                Grade Levels: K - 12<br/>\r\n        Year Founded: 1999<br/>\n<span>Other Accreditation: AdvancED, SACS</span>\n</p>', 'lxml')
# hone in on the paragraph I want to parse through:
target_p = soup.find('p')
# organize paragraph items into a list, although including junk for now:
dirty_list = target_p.contents
# clean up list using method I need help with:
clean_list = 

我认为列表推导是要走的路,但无法弄清楚如何磨练html标签。这不起作用:

clean_list = [x for x in dirty_list if x != '<br/>']

谢谢!

1 个答案:

答案 0 :(得分:0)

这取决于您要查找的输出格式。使用soup.p.text将为您提供所有文本,但这将包括前导空格。 Python可用于将文本拆分为行,并从每行中删除额外的空间。如果需要,可以将它们连接在一起:

from bs4 import BeautifulSoup as BS

soup = BS('<p>\n<span>Early Education Enrollment: 0</span><br/>\r\n        Elementary Enrollment: 231<br/>\r\n        Middle School Enrollment: 118<br/>\r\n        High School Enrollment: 121<br/>\r\n        Total Enrollment: 470<br/>\n<span>I20: True</span><br/>\r\n                                                Grade Levels: K - 12<br/>\r\n        Year Founded: 1999<br/>\n<span>Other Accreditation: AdvancED, SACS</span>\n</p>', 'lxml')
print '\n'.join(line.strip() for line in soup.p.text.split('\n'))

给你:

Early Education Enrollment: 0
Elementary Enrollment: 231
Middle School Enrollment: 118
High School Enrollment: 121
Total Enrollment: 470
I20: True
Grade Levels: K - 12
Year Founded: 1999
Other Accreditation: AdvancED, SACS