来自HTML文件的Web Scraping表

时间:2017-07-18 17:39:50

标签: python html web-scraping

大家好我希望得到一些帮助,我可以获取HTML文件中的表并将它们导入到csv文件中。我对网络抓取非常新,所以如果我的代码完全出错,请给我。 HTML文件包含我想要提取的三个单独的表;估计,抽样误差和估计中的非零地块数量。

我的代码如下所示:

#import necessary libraries
import urllib2
import pandas as pd

#specify URL
table = "file:///C:/Users/TMccw/Anaconda2/FiaAPI/outFArea18.html"

#Query the website & return the html to the variable 'page'
page = urllib2.urlopen(table)

#import the bs4 functions to parse the data returned from the website
from bs4 import BeautifulSoup

#Parse the html in the 'page' variable & store it in bs4 format
soup = BeautifulSoup(page, 'html.parser')

#Print out the html code with the function prettify
print soup.prettify()

#Find the tables & check type
table2 = soup.find_all('table')
print(table2)
print type(table2)

#Create new table as a dataframe
new_table = pd.DataFrame(columns=range(0,4))

#Extract the info from the HTML code 
soup.find('table').find_all('td'),{'align':'right'}

#Remove the tags and extract table info into CSV
???

以下是第一张表格的#34; Estimate":

 ` Estimate:
     </b>
     </caption>
     <tr>
     <td>
     </td>
    <td align="center" colspan="5">
     <b>
      Ownership group
     </b>
    </td>
   </tr>
   <tr>
    <th>
     <b>
      Forest type group
     </b>
    </th>
    <td>
     <b>
      Total
     </b>
    </td>
    <td>
     <b>
      National Forest
     </b>
    </td>
    <td>
     <b>
      Other federal
     </b>
    </td>
    <td>
     <b>
      State and local
     </b>
    </td>
    <td>
     <b>
      Private
     </b>
    </td>
   </tr>
   <tr>
    <td nowrap="">
     <b>
      Total
     </b>
    </td>
    <td align="right">
     4,875,993
    </td>
    <td align="right">
     195,438
    </td>
    <td align="right">
     169,500
    </td>
    <td align="right">
     392,030
    </td>
    <td align="right">
     4,119,025
    </td>
   </tr>
   <tr>
    <td nowrap="">
     <b>
      White / red / jack pine group
     </b>
    </td>
    <td align="right">
     40,492
    </td>
    <td align="right">
     3,426
    </td>
    <td align="right">
     -
    </td>
    <td align="right">
     10,850
    </td>
    <td align="right">
     26,217
    </td>
   </tr>
   <tr>
    <td nowrap="">
     <b>
      Loblolly / shortleaf pine group
     </b>
    </td>
    <td align="right">
     38,267
    </td>
    <td align="right">
     11,262
    </td>
    <td align="right">
     997
    </td>
    <td align="right">
     4,015
    </td>
    <td align="right">
     21,993
    </td>
   </tr>
   <tr>
    <td nowrap="">
     <b>
      Other eastern softwoods group
     </b>
    </td>
    <td align="right">
     25,181
    </td>
    <td align="right">
     -
    </td>
    <td align="right">
     -
    </td>
    <td align="right">
     -
    </td>
    <td align="right">
     25,181
    </td>
   </tr>
   <tr>
    <td nowrap="">
     <b>
      Exotic softwoods group
     </b>
    </td>
    <td align="right">
     5,868
    </td>
    <td align="right">
     -
    </td>
    <td align="right">
     -
    </td>
    <td align="right">
     662
    </td>
    <td align="right">
     5,206
    </td>
   </tr>
   <tr>
    <td nowrap="">
     <b>
      Oak / pine group
     </b>
    </td>
    <td align="right">
     144,238
    </td>
    <td align="right">
     9,592
    </td>
    <td align="right">
     -
    </td>
    <td align="right">
     21,475
    </td>
    <td align="right">
     113,171
    </td>
   </tr>
   <tr>
    <td nowrap="">
     <b>
      Oak / hickory group
     </b>
    </td>
    <td align="right">
     3,480,272
    </td>
    <td align="right">
     152,598
    </td>
    <td align="right">
     123,900
    </td>
    <td align="right">
     285,305
    </td>
    <td align="right">
     2,918,470
    </td>
   </tr>
   <tr>
    <td nowrap="">
     <b>
      Oak / gum / cypress group
     </b>
    </td>
    <td align="right">
     76,302
    </td>
    <td align="right">
     -
    </td>
    <td align="right">
     12,209
    </td>
    <td align="right">
     9,311
    </td>
    <td align="right">
     54,782
    </td>
   </tr>
   <tr>
    <td nowrap="">
     <b>
      Elm / ash / cottonwood group
     </b>
    </td>
    <td align="right">
     652,001
    </td>
    <td align="right">
     7,105
    </td>
    <td align="right">
     25,431
    </td>
    <td align="right">
     46,096
    </td>
    <td align="right">
     573,369
    </td>
   </tr>
   <tr>
    <td nowrap="">
     <b>
      Maple / beech / birch group
     </b>
    </td>
    <td align="right">
     346,718
    </td>
    <td align="right">
     10,871
    </td>
    <td align="right">
     818
    </td>
    <td align="right">
     12,748
    </td>
    <td align="right">
     322,281
    </td>
   </tr>
   <tr>
    <td nowrap="">
     <b>
      Other hardwoods group
     </b>
    </td>
    <td align="right">
     21,238
    </td>
    <td align="right">
     585
    </td>
    <td align="right">
     -
    </td>
    <td align="right">
     -
    </td>
    <td align="right">
     20,653
    </td>
   </tr>
   <tr>
    <td nowrap="">
     <b>
      Exotic hardwoods group
     </b>
    </td>
    <td align="right">
     2,441
    </td>
    <td align="right">
     -
    </td>
    <td align="right">
     -
    </td>
    <td align="right">
     -
    </td>
    <td align="right">
     2,441
    </td>
   </tr>
   <tr>
    <td nowrap="">
     <b>
      Nonstocked
     </b>
    </td>
    <td align="right">
     42,975
    </td>
    <td align="right">
     -
    </td>
    <td align="right">
     6,144
    </td>
    <td align="right">
     1,570
    </td>
    <td align="right">
     35,261
    </td>
   </tr>
  </table>
  <br/>
  <table border="4" cellpadding="4" cellspacing="4">
   <caption>
    <b>`

2 个答案:

答案 0 :(得分:0)

不确定这里的具体问题是什么,但是马上就可以看到一个会让你失望的错误。

new_table = pd.DataFrame(columns=range(0-4))

需要

new_table = pd.DataFrame(columns=range(0,4))

范围(0-4)的结果实际上是范围(-4),其评估范围(0,-4),而您想要范围(0,4)。您只需将范围(4)作为参数或范围(0,4)传递。

答案 1 :(得分:0)

我制作了几张与你的几乎完全相同的表格,并将它们放入一个相当可观的HTML页面中。然后我运行了这段代码。

>>> import bs4
>>> import pandas as pd
>>> soup = bs4.BeautifulSoup(open('temp.htm').read(), 'html.parser')
>>> tables = soup.findAll('table')
>>> for t, table in enumerate(tables):
...     df = pd.read_html(str(table), skiprows=2)
...     df[0].to_csv('table%s.csv' % t)

结果是这样的四个文件,名为table0.csv到table3.csv。

,0,1,2,3,4,5
0,Total,4875993,195438,169500,392030,4119025
1,White / red / jack pine group,40492,3426,-,10850,26217
2,Loblolly / shortleaf pine group,38267,11262,997,4015,21993
3,Other eastern softwoods group,25181,-,-,-,25181
4,Exotic softwoods group,5868,-,-,662,5206
5,Oak / pine group,144238,9592,-,21475,113171
6,Oak / hickory group,3480272,152598,123900,285305,2918470
7,Oak / gum / cypress group,76302,-,12209,9311,54782
8,Elm / ash / cottonwood group,652001,7105,25431,46096,573369
9,Maple / beech / birch group,346718,10871,818,12748,322281
10,Other hardwoods group,21238,585,-,-,20653
11,Exotic hardwoods group,2441,-,-,-,2441
12,Nonstocked,42975,-,6144,1570,35261

也许我应该提到的主要是我跳过了BeautifulSoup提供的每个表中相同数量的行。如果表格中标题行的数量不同,那么您将不得不做一些更聪明的事情,或者只是丢弃输出文件中的行并省略skiprows参数。