我可以在一个漂亮的汤标签中修改文本而不将其转换为字符串吗?

时间:2014-09-16 13:02:34

标签: python html beautifulsoup

我正在使用BeautifulSoup解析一个被抓取的网页,并且一如既往,页面的常规格式有奇怪的例外。

到目前为止我所拥有的是一个表格,我已将所有行都放入rows,并将所有列都放入cols(其中包含所有<td>个)然后我从元素中获取纯文本以便稍后使用。

这看起来像:

soup = BeautifulSoup(html)
table = soup.find("table", {"class" : "election"})
rows = table.findAll("tr")
data =[]

for row in rows:
    cols = row.findAll('td')
    cols = [ele.text.strip() for ele in cols]

data.append([ele for ele in cols if ele]) # Get rid of empty values

return data

问题在于,<td>中有一个包含多个<li>,我想通过用\n替换它们来保留它们。现在,使用.text ele属性删除所有标记,包括<li> s。

我的问题是:是否可以以仅保留某些标签的方式使用.text?我知道我可以先将ele转换为字符串,但是我不能自动删除所有其他丑陋的标签。

以下是<td>包含<li> s的html示例:

<td> November General Election Day.Scheduled Elections:
    <ul class="vtips">
        <li>Federal, Statewide, Legislative and Judicial Offices</li>
        <li>County Offices</li>
        <li>Initiatives and Constitutional Amendments, if applicable</li>
    </ul>
</td>

现在,我的代码输出:

u'November General Election Day.Scheduled Elections:Federal, Statewide, Legislative and Judicial OfficesCounty OfficesInitiatives and Constitutional Amendments, if applicable'

我希望它看起来更像:

u'November General Election Day.Scheduled Elections:\nFederal, Statewide, Legislative and Judicial Offices\nCounty Offices\nInitiatives and Constitutional Amendments, if applicable'

1 个答案:

答案 0 :(得分:2)

我仍然不确定这个问题背后的动机是什么,但这就是这个想法。

在内容的开头找到所有li代码和insert()换行符。

工作示例(我已向td添加了一些其他标签以演示行为):

from bs4 import BeautifulSoup

data = """
<td> November General Election Day.Scheduled Elections:
    <b>My Test String </b>
    <ul class="vtips">
        <li>Federal, Statewide, Legislative and Judicial Offices</li><li>County Offices</li><li>Initiatives and Constitutional Amendments, if applicable</li>
    </ul>
    <p>New Paragraph</p>
</td>
"""

soup = BeautifulSoup(data, 'html.parser')
for element in soup.td.find_all('li'):
    element.insert(0, '\n')

print soup.td.text

打印:

November General Election Day.Scheduled Elections:
    My Test String 


Federal, Statewide, Legislative and Judicial Offices
County Offices
Initiatives and Constitutional Amendments, if applicable

New Paragraph

以下是您在案例中应用解决方案的方法:

from bs4 import BeautifulSoup

html = """
<table class="election">
    <tr>
        <td> November General Election Day.Scheduled Elections:
            <b>My Test String </b>
            <ul class="vtips">
                <li>Federal, Statewide, Legislative and Judicial Offices</li><li>County Offices</li><li>Initiatives and Constitutional Amendments, if applicable</li>
            </ul>
            <p>New Paragraph</p>
        </td>
    </tr>
</table>
"""

soup = BeautifulSoup(html)
table = soup.find("table", {"class": "election"})
rows = table.find_all("tr")

data = []
for row in rows:
    for element in row.select('td li'):
        element.insert(0, '\n')
    data.append([ele.text.strip() for ele in row('td')])

print data