Question

我有一栏看起来像这样：

2014 Estimate
<td>1,968</td>
<td>185</td>
<td>845</td>
<td>439</td>
<td>107</td>
<td>2,735</td>
<td>1,312</td>
<td>1,285<sup id="cite_ref-4" class="reference"><a href="#cite_note-4">[4]</a></sup></td>

需要进行一些清理，输出应如下所示：

2014 Estimamte
    1968
    185
    845
    439107
    2735
    1312
    1285

我猜想解决方案看起来像是对它应用正则表达式的行的迭代？我只是不确定如何去做，任何提示将不胜感激

Answer 1

使用BeautifulSoup：

from bs4 import BeautifulSoup

s = """
2014 Estimate
<td>1,968</td>
<td>185</td>
<td>845</td>
<td>439</td>
<td>107</td>
<td>2,735</td>
<td>1,312</td>
<td>1,285<sup id="cite_ref-4" class="reference"><a href="#cite_note-4">[4]</a></sup></td>
"""

soup = BeautifulSoup(s, "html.parser")
# Remove [4] in the example
[a.extract() for a in soup("a")]
# Remove commas in numbers
[td.replace_with(td.text.replace(",", "")) for td in soup("td")]

print(soup.text)

输出

2014 Estimate
1968
185
845
439
107
2735
1312
1285

Answer 2

删除尖括号和方括号以及逗号中的所有内容。

import re
data = '''2014 Estimate
<td>1,968</td>
<td>185</td>
<td>845</td>
<td>439</td>
<td>107</td>
<td>2,735</td>
<td>1,312</td>
<td>1,285<sup id="cite_ref-4" class="reference"><a href="#cite_note-4">[4]</a></sup></td>'''
print(re.sub(r'<.*?>|\[.*?\]|,', '', data, flags=re.DOTALL))

这将输出：

2014 Estimate
1968
185
845
439
107
2735
1312
1285

清理python中的列

2 个答案:

输出