从Python主体中将数据提取到Excel文件

时间:2019-01-08 00:36:00

标签: python html python-2.7 parsing beautifulsoup

我正在使用mechanize从我已订阅的受密码保护的网站中获取一些数据。

我可以使用以下代码访问网站的.txt:

import mechanize
from bs4 import BeautifulSoup

username = ''
password = ''

login_post_url = "http://www.naturalgasintel.com/user/login"
internal_url = "https://naturalgasintel.com/ext/resources/Data-Feed/Daily-GPI/2018/12/20181221td.txt"

browser = mechanize.Browser()
browser.open(login_post_url)
browser.select_form(nr = 1)
browser.form['user[email]'] = username
browser.form['user[password]'] = password
browser.submit()

response = browser.open(internal_url)
print response.read().decode('utf-8').encode('utf-8')

这会打印出我想要的格式(减去数据点之间多余的空白):

Point Code      Issue Date      Trade Date      Region  Pricing Point   Low     High    Average Volume  Deals   Delivery Start Date     Delivery End Date
STXAGUAD        2018-12-21      2018-12-20      South Texas     Agua Dulce                                              2018-12-21      2018-12-21
STXFGTZ1        2018-12-21      2018-12-20      South Texas     Florida Gas Zone 1      3.580   3.690   3.660   30      7       2018-12-21      2018-12-21
STXNGPL 2018-12-21      2018-12-20      South Texas     NGPL S. TX                                              2018-12-21      2018-12-21
STXTENN 2018-12-21      2018-12-20      South Texas     Tennessee Zone 0 South  3.460   3.580   3.525   230     42      2018-12-21      2018-12-21
STXTETCO        2018-12-21      2018-12-20      South Texas     Texas Eastern S. TX     3.510   3.575   3.530   120     28      2018-12-21      2018-12-21
STXST30 2018-12-21      2018-12-20      South Texas     Transco Zone 1  3.505   3.505   3.505   9       2       2018-12-21      2018-12-21
STX3PAL 2018-12-21      2018-12-20      South Texas     Tres Palacios   3.535   3.720   3.630   196     24      2018-12-21      2018-12-21
STXRAVG 2018-12-21      2018-12-20      South Texas     S. TX Regional Avg.     3.460   3.720   3.570   584     103     2018-12-21      2018-12-21

但是我想将所有这些数据读写到Excel文件中。

我尝试使用soup = BeautifulSoup(response.read().decode('utf-8').encode('utf-8')将其分解为实际的文本,除了html形式以外,它给了我同样的东西:

<html><body><p>Point Code\tIssue Date\tTrade Date\tRegion\tPricing Point\tLow\tHigh\tAverage\tVolume\tDeals\tDelivery Start Date\tDelivery End Date\nSTXAGUAD\t2018-12-21\t2018-12-20\tSouth Texas\tAgua Dulce\t\t\t\t\t\t2018-12-21\t2018-12-21\nSTXFGTZ1\t2018-12-21\t2018-12-20\tSouth Texas\tFlorida Gas Zone 1\t3.580\t3.690\t3.660\t30\t7\t2018-12-21\t2018-12-21\nSTXNGPL\t2018-12-21\t2018-12-20\tSouth Texas\tNGPL S. TX\t\t\t\t\t\t2018-12-21\t2018-12-21\nSTXTENN\t2018-12-21\t2018-12-20\tSouth Texas\tTennessee Zone 0 South\t3.460\t3.580\t3.525\t230\t42\t2018-12-21\t2018-12-21\nSTXTETCO\t2018-12-21\t2018-12-20\tSouth Texas\tTexas Eastern S. TX\t3.510\t3.575\t3.530\t120\t28\t2018-12-21\t2018-12-21\

我可以开始考虑从此soup变量中剥离html标签,但是有没有一种方法可以更轻松地剥离这些数据?

1 个答案:

答案 0 :(得分:1)

由于您已经表示可以使用python3,所以我建议执行以下步骤:

下载Anaconda

Download Anaconda Python for you OS

从更广泛的角度来看,Anaconda在数据科学和数据检索方面拥有最佳的本机支持。您将下载python 3.7,它将为您提供Python 2.7的所有功能(几处更改),而不会让人头疼。对于您的情况重要的是,在使用utf-8时,python 2.7会给您带来痛苦。这将解决许多问题:

安装您的库

在安装Anaconda之后,(如果在安装过程中选择退出,并且将conda.exe设置为系统PATH变量which takes 2 minutes之后),则需要安装软件包。从您的脚本来看,看起来像这样:

conda install mechanize,bs4,requests,lxml -y

请耐心等待-康达可能需要2到10分钟才能在安装某些东西之前“解决您的环境”。

使用熊猫解析数据

您可以在这里尝试2种选择,它们取决于您对所抓取的html格式的幸运程度

import pandas as pd # This can go at the top with the other imports.

使用pandas.read_html()

response = browser.open(internal_url)
html = response.read().decode('utf-8').encode('utf-8')
df = pd.read_html(html)
print(df) # This should give you a preview of *fingers-crossed* each piece of data in it's own cell.
pd.to_csv(df,"naturalgasintel.csv")

使用pandas.read_data()

response = browser.open(internal_url)
soup = BeautifulSoup(str(innerHTML.encode('utf-8').strip()), 'lxml')
# If your data is embedded within a nested table, you may need to run soup.find() here
df = pd.DataFrame.from_records(soup)
print(df) # This should give you a preview of *fingers-crossed* each piece of data in it's own cell.
pd.to_csv(df,"naturalgasintel.csv")

希望有帮助! Pandas是一个出色的库,可以直观地解析您的数据。