Python - Web抓取HTML表并打印到CSV

时间:2017-09-15 15:08:17

标签: python html csv web-scraping beautifulsoup

我几乎是Python的新手,但我正在寻找构建一个网络抓取工具,它将在线从HTML表中删除数据并以相同的格式将其打印成CSV格式。

这是HTML表的一个示例(它非常庞大,所以我只提供几行)。

<div class="col-xs-12 tab-content">
        <div id="historical-data" class="tab-pane active">
          <div class="tab-header">
            <h2 class="pull-left bottom-margin-2x">Historical data for Bitcoin</h2>

            <div class="clear"></div>

            <div class="row">
              <div class="col-md-12">
                <div class="pull-left">
                  <small>Currency in USD</small>
                </div>
                <div id="reportrange" class="pull-right">
                    <i class="glyphicon glyphicon-calendar fa fa-calendar"></i>&nbsp;
                    <span>Aug 16, 2017 - Sep 15, 2017</span> <b class="caret"></b>
                </div>
              </div>
            </div>

            <table class="table">
              <thead>
              <tr>
                <th class="text-left">Date</th>
                <th class="text-right">Open</th>
                <th class="text-right">High</th>
                <th class="text-right">Low</th>
                <th class="text-right">Close</th>
                <th class="text-right">Volume</th>
                <th class="text-right">Market Cap</th>
              </tr>
              </thead>
              <tbody>

                <tr class="text-right">
                  <td class="text-left">Sep 14, 2017</td>
                  <td>3875.37</td>     
                  <td>3920.60</td>
                  <td>3153.86</td>
                  <td>3154.95</td>
                  <td>2,716,310,000</td>
                  <td>64,191,600,000</td>
                </tr>

                <tr class="text-right">
                  <td class="text-left">Sep 13, 2017</td>
                  <td>4131.98</td>     
                  <td>4131.98</td>
                  <td>3789.92</td>
                  <td>3882.59</td>
                  <td>2,219,410,000</td>
                  <td>68,432,200,000</td>
                </tr>

                <tr class="text-right">
                  <td class="text-left">Sep 12, 2017</td>
                  <td>4168.88</td>     
                  <td>4344.65</td>
                  <td>4085.22</td>
                  <td>4130.81</td>
                  <td>1,864,530,000</td>
                  <td>69,033,400,000</td>
                </tr>                
              </tbody>
            </table>
          </div>

        </div>
    </div>

我特别感兴趣的是使用相同的列标题重新创建表格:&#34;日期,&#34; &#34;打开,&#34; &#34;高,&#34; &#34;低,&#34; &#34;关闭,&#34; &#34;音量,&#34; &#34;市值。&#34;目前,我已经能够编写一个简单的脚本,它将基本上转到URL,下载HTML,使用BeautifulSoup进行解析,然后使用&#39; for&#39;获取td元素的语句。下面是我的代码示例(URL省略)和结果:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv

url = "enterURLhere"
page = requests.get(url)
pagetext = page.text

pricetable = {
    "Date" : [],
    "Open" : [],
    "High" : [],
    "Low" : [],
    "Close" : [],
    "Volume" : [],
    "Market Cap" : []
}

soup = BeautifulSoup(pagetext, 'html.parser')

file = open("test.csv", 'w')

for row in soup.find_all('tr'):
    for col in row.find_all('td'):
        print(col.text)

sample output

任何人都有关于如何至少将数据重新格式化到表中的指示?感谢。

1 个答案:

答案 0 :(得分:2)

运行代码,您将从该表中获得所需的数据。要给它一个并从这个元素中提取数据,你需要做的就是将你粘贴在上面的整个html元素包装在html=''' '''

import csv
from bs4 import BeautifulSoup

outfile = open("table_data.csv","w",newline='')
writer = csv.writer(outfile)

tree = BeautifulSoup(html,"lxml")
table_tag = tree.select("table")[0]
tab_data = [[item.text for item in row_data.select("th,td")]
                for row_data in table_tag.select("tr")]

for data in tab_data:
    writer.writerow(data)
    print(' '.join(data))

我试图将代码分解成碎片让你理解。我上面做的是一个嵌套的for循环。以下是它如何分开:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,"lxml")
table = soup.find('table')

list_of_rows = []
for row in table.findAll('tr'):
    list_of_cells = []
    for cell in row.findAll(["th","td"]):
        text = cell.text
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

for item in list_of_rows:
    print(' '.join(item))

结果:

Date Open High Low Close Volume Market Cap
Sep 14, 2017 3875.37 3920.60 3153.86 3154.95 2,716,310,000 64,191,600,000
Sep 13, 2017 4131.98 3789.92 3882.59 2,219,410,000 68,432,200,000
Sep 12, 2017 4168.88 4344.65 4085.22 4130.81 1,864,530,000 69,033,400,000