我正在使用mechanize
从我已订阅的受密码保护的网站中获取一些数据。
我可以使用以下代码访问网站的.txt:
import mechanize
from bs4 import BeautifulSoup
username = ''
password = ''
login_post_url = "http://www.naturalgasintel.com/user/login"
internal_url = "https://naturalgasintel.com/ext/resources/Data-Feed/Daily-GPI/2018/12/20181221td.txt"
browser = mechanize.Browser()
browser.open(login_post_url)
browser.select_form(nr = 1)
browser.form['user[email]'] = username
browser.form['user[password]'] = password
browser.submit()
response = browser.open(internal_url)
print response.read().decode('utf-8').encode('utf-8')
这会打印出我想要的格式(减去数据点之间多余的空白):
Point Code Issue Date Trade Date Region Pricing Point Low High Average Volume Deals Delivery Start Date Delivery End Date
STXAGUAD 2018-12-21 2018-12-20 South Texas Agua Dulce 2018-12-21 2018-12-21
STXFGTZ1 2018-12-21 2018-12-20 South Texas Florida Gas Zone 1 3.580 3.690 3.660 30 7 2018-12-21 2018-12-21
STXNGPL 2018-12-21 2018-12-20 South Texas NGPL S. TX 2018-12-21 2018-12-21
STXTENN 2018-12-21 2018-12-20 South Texas Tennessee Zone 0 South 3.460 3.580 3.525 230 42 2018-12-21 2018-12-21
STXTETCO 2018-12-21 2018-12-20 South Texas Texas Eastern S. TX 3.510 3.575 3.530 120 28 2018-12-21 2018-12-21
STXST30 2018-12-21 2018-12-20 South Texas Transco Zone 1 3.505 3.505 3.505 9 2 2018-12-21 2018-12-21
STX3PAL 2018-12-21 2018-12-20 South Texas Tres Palacios 3.535 3.720 3.630 196 24 2018-12-21 2018-12-21
STXRAVG 2018-12-21 2018-12-20 South Texas S. TX Regional Avg. 3.460 3.720 3.570 584 103 2018-12-21 2018-12-21
但是我想将所有这些数据读写到Excel文件中。
我尝试使用soup = BeautifulSoup(response.read().decode('utf-8').encode('utf-8')
将其分解为实际的文本,除了html
形式以外,它给了我同样的东西:
<html><body><p>Point Code\tIssue Date\tTrade Date\tRegion\tPricing Point\tLow\tHigh\tAverage\tVolume\tDeals\tDelivery Start Date\tDelivery End Date\nSTXAGUAD\t2018-12-21\t2018-12-20\tSouth Texas\tAgua Dulce\t\t\t\t\t\t2018-12-21\t2018-12-21\nSTXFGTZ1\t2018-12-21\t2018-12-20\tSouth Texas\tFlorida Gas Zone 1\t3.580\t3.690\t3.660\t30\t7\t2018-12-21\t2018-12-21\nSTXNGPL\t2018-12-21\t2018-12-20\tSouth Texas\tNGPL S. TX\t\t\t\t\t\t2018-12-21\t2018-12-21\nSTXTENN\t2018-12-21\t2018-12-20\tSouth Texas\tTennessee Zone 0 South\t3.460\t3.580\t3.525\t230\t42\t2018-12-21\t2018-12-21\nSTXTETCO\t2018-12-21\t2018-12-20\tSouth Texas\tTexas Eastern S. TX\t3.510\t3.575\t3.530\t120\t28\t2018-12-21\t2018-12-21\
我可以开始考虑从此soup
变量中剥离html标签,但是有没有一种方法可以更轻松地剥离这些数据?
答案 0 :(得分:1)
由于您已经表示可以使用python3,所以我建议执行以下步骤:
Download Anaconda Python for you OS
从更广泛的角度来看,Anaconda在数据科学和数据检索方面拥有最佳的本机支持。您将下载python 3.7,它将为您提供Python 2.7的所有功能(几处更改),而不会让人头疼。对于您的情况重要的是,在使用utf-8时,python 2.7会给您带来痛苦。这将解决许多问题:
在安装Anaconda之后,(如果在安装过程中选择退出,并且将conda.exe设置为系统PATH变量which takes 2 minutes之后),则需要安装软件包。从您的脚本来看,看起来像这样:
conda install mechanize,bs4,requests,lxml -y
请耐心等待-康达可能需要2到10分钟才能在安装某些东西之前“解决您的环境”。
您可以在这里尝试2种选择,它们取决于您对所抓取的html格式的幸运程度
import pandas as pd # This can go at the top with the other imports.
response = browser.open(internal_url)
html = response.read().decode('utf-8').encode('utf-8')
df = pd.read_html(html)
print(df) # This should give you a preview of *fingers-crossed* each piece of data in it's own cell.
pd.to_csv(df,"naturalgasintel.csv")
response = browser.open(internal_url)
soup = BeautifulSoup(str(innerHTML.encode('utf-8').strip()), 'lxml')
# If your data is embedded within a nested table, you may need to run soup.find() here
df = pd.DataFrame.from_records(soup)
print(df) # This should give you a preview of *fingers-crossed* each piece of data in it's own cell.
pd.to_csv(df,"naturalgasintel.csv")
希望有帮助! Pandas是一个出色的库,可以直观地解析您的数据。