Python:如何从页面下载Excel文件

时间:2017-10-21 22:13:28

标签: python excel

  1. 转到此网址https://www.horseracebase.com/horse-racing-results.php?year=2005&month=3&day=15(用户名= TrickyBen |密码= TrickyBen123)
  2. 请注意,有一个下载Excel按钮(红色)
  3. 我想下载excel文件并将其转换为pandas数据帧。我想以编程方式(即从脚本中,而不是通过手动点击网站)来完成。我该怎么做?
  4. 此代码将以TrickyBen身份登录,并向网站API发出请求...

    导入请求     来自lxml import html     来自请求导入会话     将pandas导入为pd     import shutil

    raceSession = Session()
    
    LoginDetails = {'login': 'TrickyBen', 'password': 'TrickyBen123'}
    
    LoginUrl = 'https://www.horseracebase.com/horse-racing-results.php?year=2005&month=3&day=15/horsebase1.php'
    LoginPost = raceSession.post(LoginUrl, data=LoginDetails)
    
    RaceUrl = 'https://www.horseracebase.com/excelresults.php'
    RaceDataDetails =  {"user": "41495", "racedate": "2005-3-15", "downloadbutton": "Excel"}
    
    PostHeaders = {"Content-Type": "application/x-www-form-urlencoded"}
    Response = raceSession.post(RaceUrl, data=RaceDataDetails, headers=PostHeaders)
    
    Table = pd.read_table(Response.text)
    
    Table.to_csv('blahblah.csv')
    

    如果你检查元素,你会发现相关元素看起来像这样......

    <form action="excelresults.php" method="post">
        <input type="hidden" name="user" value="41495">
        <input type="hidden" name="racedate" value="2005-3-15">
        <input type="submit" class="downloadbutton" value="Excel">
    </form>
    

    我收到此错误消息...

    Traceback (most recent call last):
      File "/Users/Alex/Desktop/DateTest/hrpull.py", line 20, in <module>
        Table = pd.read_table(Response.text)
      File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 562, in parser_f
        return _read(filepath_or_buffer, kwds)
      File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 315, in _read
        parser = TextFileReader(filepath_or_buffer, **kwds)
      File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 645, in __init__
        self._make_engine(self.engine)
      File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 799, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
      File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 1213, in __init__
    self._reader = _parser.TextReader(src, **kwds)
      File "pandas/parser.pyx", line 358, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3427)
      File "pandas/parser.pyx", line 628, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:6861)
    IOError: File race_date race_time   track   race_name       race_restrictions_age   race_class  major   race_distance   prize_money     going_description   number_of_runners   place   distbt  horse_name  stall       trainer horse_age   jockey_name jockeys_claim   pounds  odds    fav     official_rating comptime    TotalDstBt  MedianOR    Dist_Furlongs       placing_numerical   RCode   BFSP    BFSP_Place  PlcsPaid    BFPlcsPaid      Yards   RailMove    RaceType    
    "2005-03-15"    "14:00:00"  "Cheltenham"    "Letheby & Christopher Supreme Novices Hurdle " "4yo+"  "Class 1"   "Grade 1"   "2m˝f " "58000" "Good"  "20"    "1st"       "Arcalis"   "0" "Johnson, J Howard" "5" "Lee, G"    "0" "161"   "21"        "136"   "3 mins 53.00s"     "121.5" "16.5"  "1" "National Hunt" "0" "0" "3" "0" "0" "0" "Novices Hurdle"
    "2005-03-15"    "14:00:00"  "Cheltenham"    "Letheby & Christopher Supreme Novices Hurdle " "4yo+"  "Class 1"   "Grade 1"   "2m˝f " "58000" "Good"  "20"    "2nd"   "6" "Wild Passion (GER)"    "0" "Meade, Noel"   "5" "Carberry, P"   "0" "161"   "11"        "0" "3 mins 53.00s" "6" "121.5" "16.5"  "2" "National Hunt" "0" "0" "3" "0" "0" "0" "Novices Hurdle"
    

2 个答案:

答案 0 :(得分:0)

我认为您可以在另一个网页上看到要下载的数据,例如,点击&#34;我的系统(v4)&#34;。如果您可以这样做,那么您可以使用urllib.request.urlretrieve下载该页面。然后你可以使用html.parser.HTMLParser来解析数据并按照你的意愿去做。

答案 1 :(得分:0)

如果您要查看表单操作中调用的api,您会看到您要对此网址发布请求:

https://www.horseracebase.com/excelresults.php

具有以下参数:

data = {
    "user": "41495", # looks like this varies with login, so update in case you change your login id
    "racedate": "2005-3-15",
    "downloadbutton": "Excel"
}

你可以这样做:

response = raceSession.post(reqUrl, json=data)

如果这不起作用,请尝试在请求中添加标头,例如:headers=postHeaders。对于前者在这种情况下,您应该设置内容类型标题,因为您要发送表单编码数据,所以:

headers = {"Content-Type": "application/x-www-form-urlencoded"} 

阅读this以获取有关如何将Excel保存到文件的更多信息

以下是邮递员对此请求的回复,因此除了content-type之外,您似乎不需要任何其他标题:

enter image description here

修改

这是你需要做的:

raceSession = Session()

RaceUrl = 'https://www.horseracebase.com/excelresults.php'
RaceDataDetails =  {"user": "41495", "racedate": "2005-3-15", "downloadbutton": "Excel"}

PostHeaders = {"Content-Type": "application/x-www-form-urlencoded"}
Response = raceSession.post(RaceUrl, data=RaceDataDetails, headers=PostHeaders)
# from StringIO import StringIO #for python 2.x
#import StringIO #for python 3.x
Table = pd.read_table(StringIO(Response.text))