在Pandas中加载通用Google电子表格

时间:2014-06-05 15:00:24

标签: python pandas gdata

当我尝试在pandas

中加载Google电子表格时
from StringIO import StringIO  
import requests
r = requests.get('https://docs.google.com/spreadsheet/ccc?key=<some_long_code>&output=csv')
data = r.content
df = pd.read_csv(StringIO(data), index_col=0)

我得到以下内容:

CParserError: Error tokenizing data. C error: Expected 1316 fields in line 73, saw 1386

为什么呢?我认为可以使用数据识别电子表格的行和列集合,并分别使用电子表格行和列作为数据框索引和列(对于任何空的NaN)。为什么会失败?

3 个答案:

答案 0 :(得分:7)

我的这个问题显示了Getting Google Spreadsheet CSV into A Pandas Dataframe

的方式

正如其中一位评论员所说,你没有要求提供CSV格式的数据,你可以使用&#34;编辑&#34;请求在网址的末尾 您可以使用此代码并在电子表格中查看它(顺便说一下,它需要公开...)也可以执行私人表格,但这是另一个主题。

from StringIO import StringIO  # got moved around in python3 if you're using that.

import requests
r = requests.get('https://docs.google.com/spreadsheet/ccc?key=0Ak1ecr7i0wotdGJmTURJRnZLYlV3M2daNTRubTdwTXc&output=csv')
data = r.content

In [10]: df = pd.read_csv(StringIO(data), index_col=0,parse_dates=['Quradate'])

In [11]: df.head()
Out[11]: 
          City                                            region     Res_Comm  \
0       Dothan  South_Central-Montgomery-Auburn-Wiregrass-Dothan  Residential   
10       Foley                              South_Mobile-Baldwin  Residential   
12  Birmingham      North_Central-Birmingham-Tuscaloosa-Anniston   Commercial   
38       Brent      North_Central-Birmingham-Tuscaloosa-Anniston  Residential   
44      Athens                 North_Huntsville-Decatur-Florence  Residential   

          mkt_type            Quradate  National_exp  Alabama_exp  Sales_exp  \
0            Rural 2010-01-15 00:00:00             2            2          3   
10  Suburban_Urban 2010-01-15 00:00:00             4            4          4   
12  Suburban_Urban 2010-01-15 00:00:00             2            2          3   
38           Rural 2010-01-15 00:00:00             3            3          3   
44  Suburban_Urban 2010-01-15 00:00:00             4            5          4   

用于获取csv输出的新Google电子表格网址格式为

https://docs.google.com/spreadsheets/d/177_dFZ0i-duGxLiyg6tnwNDKruAYE-_Dd8vAQziipJQ/export?format=csv&id

他们现在需要稍微改变网址格式:

https://docs.google.com/spreadsheets/d/177_dFZ0i-duGxLiyg6tnwNDKruAYE-_Dd8vAQziipJQ/export?format=csv&gid=0 #for the 1st sheet

我还发现我需要执行以下操作来处理Python 3,对上面的内容稍作修改:

from io import StringIO 

并获取文件:

guid=0 #for the 1st sheet
act = requests.get('https://docs.google.com/spreadsheets/d/177_dFZ0i-duGxLiyg6tnwNDKruAYE-_Dd8vAQziipJQ/export?format=csv&gid=%s' % guid)
dataact = act.content.decode('utf-8') #To convert to string for Stringio
actdf = pd.read_csv(StringIO(dataact),index_col=0,parse_dates=[0], thousands=',').sort()

actdf现在是带有标题(列名称)的完整pandas数据框

答案 1 :(得分:1)

在Google工作表中点击文件&gt;发布到网络。然后选择您需要发布的内容并选择导出格式.csv。你会得到类似的链接: https://docs.google.com/spreadsheets/d/<your sheets key yhere>/pub?gid=1317664180&single=true&output=csv

然后简单地说:

import pandas as pd
pathtoCsv = r'https://docs.google.com/spreadsheets/d/<sheets key>/pub?gid=1317664180&single=true&output=csv'
dev = pd.read_csv(pathtoCsv)
print dev

答案 2 :(得分:0)

要导出为csv的当前Google云端硬盘URL是:

https://drive.google.com/uc?export=download&id=EnterIDHere

所以:

import pandas as pd
pathtocsv = r'https://drive.google.com/uc?export=download&id=EnterIDHere'
df = pd.read_csv(pathtocsv)