我最近看到另一个用户提出了一个关于从网络表Extracting information from a webpage with python中提取信息的问题。来自ekhumoro的答案在其他用户询问的页面上运行良好。见下文。
from urllib2 import urlopen
from lxml import etree
url = 'http://www.uscho.com/standings/division-i-men/2011-2012/'
tree = etree.HTML(urlopen(url).read())
for section in tree.xpath('//section[starts-with(@id, "section_")]'):
print section.xpath('h3[1]/text()')[0]
for row in section.xpath('table/tbody/tr'):
cols = row.xpath('td//text()')
print ' ', cols[0].ljust(25), ' '.join(cols[1:])
print
我的问题是使用此代码作为解析此页面的指南http://www.uscho.com/rankings/d-i-mens-poll/ 。使用以下更改我只能打印h1和h3。
输入
url = 'http://www.uscho.com/rankings/d-i-mens-poll/'
tree = etree.HTML(urlopen(url).read())
for section in tree.xpath('//section[starts-with(@id, "rankings")]'):
print section.xpath('h1[1]/text()')[0]
print section.xpath('h3[1]/text()')[0]
for row in section.xpath('table/tbody/tr'):
cols = row.xpath('td/b/text()')
print ' ', cols[0].ljust(25), ' '.join(cols[1:])
print
输出
USCHO.com Division I Men's Poll
December 12, 2011
表的结构似乎是一样的,所以我不知道为什么我不能使用类似的代码。我只是一个机械工程师。任何帮助表示赞赏。
答案 0 :(得分:4)
lxml
很棒,但如果您不熟悉xpath
,我建议您BeautifulSoup
:
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
url = 'http://www.uscho.com/rankings/d-i-mens-poll/'
soup = BeautifulSoup(urlopen(url).read())
section = soup.find('section', id='rankings')
h1 = section.find('h1')
print h1.text
h3 = section.find('h3')
print h3.text
print
rows = section.find('table').findAll('tr')[1:-1]
for row in rows:
columns = [data.text for data in row.findAll('td')[1:]]
print '{0:20} {1:4} {2:>6} {3:>4}'.format(*columns)
此脚本的输出为:
USCHO.com Division I Men's Poll
December 12, 2011
Minnesota-Duluth (49) 12-3-3 999
Minnesota 14-5-1 901
Boston College 12-6-0 875
Ohio State ( 1) 13-4-1 848
Merrimack 10-2-2 844
Notre Dame 11-6-3 667
Colorado College 9-5-0 650
Western Michigan 9-4-5 647
Boston University 10-5-1 581
Ferris State 11-6-1 521
Union 8-3-5 510
Colgate 11-4-2 495
Cornell 7-3-1 347
Denver 7-6-3 329
Michigan State 10-6-2 306
Lake Superior 11-7-2 258
Massachusetts-Lowell 10-5-0 251
North Dakota 9-8-1 88
Yale 6-5-1 69
Michigan 9-8-3 62
答案 1 :(得分:2)
表的结构略有不同,并且列中有空白条目。
可能的lxml
解决方案:
from urllib2 import urlopen
from lxml import etree
url = 'http://www.uscho.com/rankings/d-i-mens-poll/'
tree = etree.HTML(urlopen(url).read())
for section in tree.xpath('//section[@id="rankings"]'):
print section.xpath('h1[1]/text()')[0],
print section.xpath('h3[1]/text()')[0]
print
for row in section.xpath('table/tr[@class="even" or @class="odd"]'):
print '%-3s %-20s %10s %10s %10s %10s' % tuple(
''.join(col.xpath('.//text()')) for col in row.xpath('td'))
print
输出:
USCHO.com Division I Men's Poll December 12, 2011
1 Minnesota-Duluth (49) 12-3-3 999 1
2 Minnesota 14-5-1 901 2
3 Boston College 12-6-0 875 3
4 Ohio State ( 1) 13-4-1 848 4
5 Merrimack 10-2-2 844 5
6 Notre Dame 11-6-3 667 7
7 Colorado College 9-5-0 650 6
8 Western Michigan 9-4-5 647 8
9 Boston University 10-5-1 581 11
10 Ferris State 11-6-1 521 9
11 Union 8-3-5 510 10
12 Colgate 11-4-2 495 12
13 Cornell 7-3-1 347 16
14 Denver 7-6-3 329 13
15 Michigan State 10-6-2 306 14
16 Lake Superior 11-7-2 258 15
17 Massachusetts-Lowell 10-5-0 251 18
18 North Dakota 9-8-1 88 19
19 Yale 6-5-1 69 17
20 Michigan 9-8-3 62 NR
答案 2 :(得分:0)
将'table/tbody/tr'
替换为'table/tr'
。
答案 3 :(得分:0)
虽然这个答案很旧,但它仍然出现在网络上。
我想要另一个(更直接和最新的)选项。虽然它增加了更多的依赖(熊猫和制表(这是 to_markdown 方法的依赖))...
不幸的是,我认为与此问题中使用的 url 相关的网页从那时起发生了很大变化(该表现在是从 javascript 生成的,不再存在于源代码中)。因此,出于实际目的,我将跳到这个 url。
from lxml import etree, html
import pandas as pd
import requests
url = 'https://www.w3schools.com/html/html_tables.asp'
r = requests.get(url)
#If you want to get a specific table, procede as follow :
tree_html = html.fromstring(r.content)
first_table = tree_html.xpath(".//table")[0]
df = pd.read_html(etree.tostring(table))[0]
print(df.to_markdown())
输出:
| | Tag | Description |
|---:|:-----------|:------------------------------------------------------------------------|
| 0 | <table> | Defines a table |
| 1 | <th> | Defines a header cell in a table |
| 2 | <tr> | Defines a row in a table |
| 3 | <td> | Defines a cell in a table |
| 4 | <caption> | Defines a table caption |
| 5 | <colgroup> | Specifies a group of one or more columns in a table for formatting |
| 6 | <col> | Specifies column properties for each column within a <colgroup> element |
| 7 | <thead> | Groups the header content in a table |
| 8 | <tbody> | Groups the body content in a table |
| 9 | <tfoot> | Groups the footer content in a table |
但是您也可以通过这种方式一次性获得所有表格:
list_tables = pd.read_html(r.content)
for table in list_table:
print(table.to_markdown()+'\n')
输出:
| | Company | Contact | Country |
|---:|:-----------------------------|:-----------------|:----------|
| 0 | Alfreds Futterkiste | Maria Anders | Germany |
| 1 | Centro comercial Moctezuma | Francisco Chang | Mexico |
| 2 | Ernst Handel | Roland Mendel | Austria |
| 3 | Island Trading | Helen Bennett | UK |
| 4 | Laughing Bacchus Winecellars | Yoshi Tannamuri | Canada |
| 5 | Magazzini Alimentari Riuniti | Giovanni Rovelli | Italy |
| | Tag | Description |
|---:|:-----------|:------------------------------------------------------------------------|
| 0 | <table> | Defines a table |
| 1 | <th> | Defines a header cell in a table |
| 2 | <tr> | Defines a row in a table |
| 3 | <td> | Defines a cell in a table |
| 4 | <caption> | Defines a table caption |
| 5 | <colgroup> | Specifies a group of one or more columns in a table for formatting |
| 6 | <col> | Specifies column properties for each column within a <colgroup> element |
| 7 | <thead> | Groups the header content in a table |
| 8 | <tbody> | Groups the body content in a table |
| 9 | <tfoot> | Groups the footer content in a table |