Question

慢慢学习python和beautifulsoup但是被这个难倒了。

我正在尝试从以下布局中提取第1列和第4列数据（缩小尺寸） http://pastebin.com/bTruubrn

该文件存储在本地，目前我有一些其他类似问题的代码拼凑而成我无法工作

for row in soup.find('table')[0]body.findall('tr'):
first_column = row.findAll('td')[0].contents
third_column = row.findAll('td')[3].contents
print (first_column, third_column)

Answer 1

使用精美汤品的support for CSS selectors：

first_column = soup.select('table tr td:nth-of-type(1)')
fourth_column = soup.select('table tr td:nth-of-type(4)')

Answer 2

您的代码有很多问题。这一行：

soup.find('table')[0]body.findall('tr'):

毫无意义。使用find时，它返回单个BS对象。您无法在单个对象上访问具有索引的元素。无论你使用findAll，它都会返回一个BS对象列表。这意味着你必须循环它以获得单个元素。这就是你的for循环体不能按预期工作的原因。

以下代码可以帮助您：

from bs4 import BeautifulSoup

html_file = open('html_file')
soup = BeautifulSoup(html_file)

table = soup.findAll('table')[0]
rows = table.findAll('tr')

first_columns = []
third_columns = []
for row in rows[1:]:
    first_columns.append(row.findAll('td')[0])
    third_columns.append(row.findAll('td')[2])

for first, third in zip(first_columns, third_columns):
    print(first.text, third.text)

Answer 3

您可能会发现htql更容易：

import htql
results=htql.query(html_data, "<table>1.<tr> {c1=<td>1:tx; c4=<td>4:tx } ");

提取特定专栏的美丽的汤

3 个答案: