Question

我有以下问题：当html标签之间有空格时，我的代码没有给我输出的文字。

而不是输出：

year|salary|bonus
2005|100,000|50,000
2006|120,000|80,000

我得到了这个：

 |salary|bonus
2005|100,000|50,000
2006|120,000|80,000

未输出文字“年份”。

这是我的代码：

from BeautifulSoup import BeautifulSoup
import re


html = '<html><body><table><tr><td> <p>year</p></td><td><p>salary</p></td><td>bonus</td></tr><tr><td>2005</td><td>100,000</td><td>50,000</td></tr><tr><td>2006</td><td>120,000</td><td>80,000</td></tr></table></html>'
soup = BeautifulSoup(html)
table = soup.find('table')
rows = table.findAll('tr')

store=[]

for tr in rows:
    cols = tr.findAll('td')
    row = []
    for td in cols:
        try:
            row.append(''.join(td.find(text=True)))
        except Exception:
            row.append('')
    store.append('|'.join(filter(None, row)))
print '\n'.join(store)

问题来自于：

"<td> <p>year</p></td>"

当我从网上提取一些HTML时，有没有办法摆脱那个空间？

Answer 1

使用：

而不是row.append(''.join(td.find(text=True)))

row.append(''.join(td.text))

输出：

year|salary|bonus
2005|100,000|50,000
2006|120,000|80,000

Answer 2

正如@Herman建议的那样，您应该使用Tag.text来查找相关文本对于您正在解析的标记。

更详细地了解为什么Tag.find()没有做你想做的事：BeautifulSoup's 事实上，Tag.find()与Tag.findAll()非常相似 Tag.find()仅使用关键字参数limit，set调用Tag.findAll() 到 1 。 Tag.findAll()然后以递归方式下降标记树并返回一旦找到满足text参数的文本。自设置text 到True，字符“你”在技术上满足这个条件，因此，是Tag.find()返回的内容。

事实上，如果打印出td.findAll(text=True, limit=2)，您可以看到返回年份。您还可以将text设置为正则表达式以忽略空格，这样您就可以td.find(text=re.compile('[\S\w]'))。

我还注意到你正在使用store.append('|'.join(filter(None, row)))。一世认为你应该使用CSV module，尤其是csv.writer。 CSV模块可以解决您在解析后的html文件中有管道时可能遇到的所有问题，并使代码更清晰。

以下是一个例子：

import csv
import re
from cStringIO import StringIO

from BeautifulSoup import BeautifulSoup


html = ('<html><body><table><tr><td> <p>year</p></td><td><p>salary</p></td>'
        '<td>bonus</td></tr><tr><td>2005</td><td>100,000</td><td>50,000</td>'
        '</tr><tr><td>2006</td><td>120,000</td><td>80,000</td></tr></table>'
        '</html>')
soup = BeautifulSoup(html)
table = soup.find('table')
rows = table.findAll('tr')

output = StringIO()
writer = csv.writer(output, delimiter='|')

for tr in rows:
    cols = tr.findAll('td')
    row = []
    for td in cols:
        row.append(td.text)

    writer.writerow(filter(None, row))

print output.getvalue()

输出是：

year|salary|bonus
2005|100,000|50,000
2006|120,000|80,000

Answer 3

使用

html = re.sub(r'\s\s+', '', html)

如何在Python中使用BeautifulSoup删除HTML标记之间的空格？

3 个答案: