Question

假设我们查看页面中的第一个表，所以：

table = BeautifulSoup(...).table

可以使用干净的for循环扫描行：

for row in table:
    f(row)

但是为了获得一个列，事情会变得混乱。

我的问题：是否有一种优雅的方式来提取单个列，无论是通过其位置还是通过其“名称”（即出现在此列第一行中的文本）？

Answer 1

lxml比BeautifulSoup快很多倍，所以你可能想要使用它。

from lxml.html import parse
doc = parse('http://python.org').getroot()
for row in doc.cssselect('table > tr'):
    for cell in row.cssselect('td:nth-child(3)'):
         print cell.text_content()

或者，而不是循环：

rows = [ row for row in doc.cssselect('table > tr') ]
cells = [ cell.text_content() for cell in rows.cssselect('td:nth-child(3)') ]
print cells

有没有一种干净的方法来使用BeautifulSoup获取html表的第n列？

1 个答案: