Question

我正在尝试使用BeautifulSoup提取this data table的第一和第三列。通过查看HTML，第一列包含<th>标记。感兴趣的另一列有<td>标记。在任何情况下，我所能得到的只是带有标签的列的列表。但是，我只想要文本。

table已经是一个列表，因此我无法使用findAll(text=True)。我不知道如何以另一种形式获得第一列的列表。

from BeautifulSoup import BeautifulSoup
from sys import argv
import re

filename = argv[1] #get HTML file as a string
html_doc = ''.join(open(filename,'r').readlines())
soup = BeautifulSoup(html_doc)
table = soup.findAll('table')[0].tbody.th.findAll('th') #The relevant table is the first one

print table

Answer 1

您可以尝试以下代码：

import urllib2
from BeautifulSoup import BeautifulSoup

url = "http://www.samhsa.gov/data/NSDUH/2k10State/NSDUHsae2010/NSDUHsaeAppC2010.htm"
soup = BeautifulSoup(urllib2.urlopen(url).read())

for row in soup.findAll('table')[0].tbody.findAll('tr'):
    first_column = row.findAll('th')[0].contents
    third_column = row.findAll('td')[2].contents
    print first_column, third_column

正如您所看到的，代码只是连接到url并获取html，而BeautifulSoup找到第一个表，然后是所有'tr'并选择第一列，即'th'，第三列，这是'td'。

Answer 2

除了@ jonhkr的回答，我还以为我会发布一个我提出的替代解决方案。

 #!/usr/bin/python

 from BeautifulSoup import BeautifulSoup
 from sys import argv

 filename = argv[1]
 #get HTML file as a string
 html_doc = ''.join(open(filename,'r').readlines())
 soup = BeautifulSoup(html_doc)
 table = soup.findAll('table')[0].tbody

 data = map(lambda x: (x.findAll(text=True)[1],x.findAll(text=True)[5]),table.findAll('tr'))
 print data

与jonhkr的答案不同，后者拨入网页，我认为你将它保存在你的计算机上并将其作为命令行参数传递。例如：

python file.py table.html

Answer 3

您也可以尝试此代码

import requests
from bs4 import BeautifulSoup
page =requests.get("http://www.samhsa.gov/data/NSDUH/2k10State/NSDUHsae2010/NSDUHsaeAppC2010.htm")
soup = BeautifulSoup(page.content, 'html.parser')
for row in soup.findAll('table')[0].tbody.findAll('tr'):
    first_column = row.findAll('th')[0].contents
    third_column = row.findAll('td')[2].contents
    print (first_column, third_column)

使用BeautifulSoup从表中提取选定的列

3 个答案: