我正在使用BeautifulSoup解析一个被抓取的网页,并且一如既往,页面的常规格式有奇怪的例外。
到目前为止我所拥有的是一个表格,我已将所有行都放入rows
,并将所有列都放入cols
(其中包含所有<td>
个)然后我从元素中获取纯文本以便稍后使用。
这看起来像:
soup = BeautifulSoup(html)
table = soup.find("table", {"class" : "election"})
rows = table.findAll("tr")
data =[]
for row in rows:
cols = row.findAll('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele]) # Get rid of empty values
return data
问题在于,<td>
中有一个包含多个<li>
,我想通过用\n
替换它们来保留它们。现在,使用.text
ele
属性删除所有标记,包括<li>
s。
我的问题是:是否可以以仅保留某些标签的方式使用.text
?我知道我可以先将ele
转换为字符串,但是我不能自动删除所有其他丑陋的标签。
以下是<td>
包含<li>
s的html示例:
<td> November General Election Day.Scheduled Elections:
<ul class="vtips">
<li>Federal, Statewide, Legislative and Judicial Offices</li>
<li>County Offices</li>
<li>Initiatives and Constitutional Amendments, if applicable</li>
</ul>
</td>
现在,我的代码输出:
u'November General Election Day.Scheduled Elections:Federal, Statewide, Legislative and Judicial OfficesCounty OfficesInitiatives and Constitutional Amendments, if applicable'
我希望它看起来更像:
u'November General Election Day.Scheduled Elections:\nFederal, Statewide, Legislative and Judicial Offices\nCounty Offices\nInitiatives and Constitutional Amendments, if applicable'
答案 0 :(得分:2)
我仍然不确定这个问题背后的动机是什么,但这就是这个想法。
在内容的开头找到所有li
代码和insert()
换行符。
工作示例(我已向td
添加了一些其他标签以演示行为):
from bs4 import BeautifulSoup
data = """
<td> November General Election Day.Scheduled Elections:
<b>My Test String </b>
<ul class="vtips">
<li>Federal, Statewide, Legislative and Judicial Offices</li><li>County Offices</li><li>Initiatives and Constitutional Amendments, if applicable</li>
</ul>
<p>New Paragraph</p>
</td>
"""
soup = BeautifulSoup(data, 'html.parser')
for element in soup.td.find_all('li'):
element.insert(0, '\n')
print soup.td.text
打印:
November General Election Day.Scheduled Elections:
My Test String
Federal, Statewide, Legislative and Judicial Offices
County Offices
Initiatives and Constitutional Amendments, if applicable
New Paragraph
以下是您在案例中应用解决方案的方法:
from bs4 import BeautifulSoup
html = """
<table class="election">
<tr>
<td> November General Election Day.Scheduled Elections:
<b>My Test String </b>
<ul class="vtips">
<li>Federal, Statewide, Legislative and Judicial Offices</li><li>County Offices</li><li>Initiatives and Constitutional Amendments, if applicable</li>
</ul>
<p>New Paragraph</p>
</td>
</tr>
</table>
"""
soup = BeautifulSoup(html)
table = soup.find("table", {"class": "election"})
rows = table.find_all("tr")
data = []
for row in rows:
for element in row.select('td li'):
element.insert(0, '\n')
data.append([ele.text.strip() for ele in row('td')])
print data