抓取 Excel 文件并即时阅读

时间:2021-07-16 14:29:48

标签: python dataframe web-scraping beautifulsoup

我正在尝试从网站 (https://rigcount.bakerhughes.com/na-rig-count) 获取 Excel 文件,将其下载并保存到内存中,以便使用 Pandas 进行读取。该文件是一个 .xlsb,超过 700,000 行。

使用我正在使用的代码,我只能获得 1457 行...我尝试使用 chunksize 但它没有用。

这是我的代码:

trending images

我尝试将其保存在本地并重新打开,但我无法解决编码问题。

谢谢你的帮助! :)

1 个答案:

答案 0 :(得分:2)

import trio
import httpx
from bs4 import BeautifulSoup
import pandas as pd
from functools import partial


async def main(url):
    async with httpx.AsyncClient(timeout=None) as client:
        r = await client.get(url)
        soup = BeautifulSoup(r.text, 'lxml')
        tfile = soup.select_one('.file-link:-soup-contains(Table)').a['href']
        async with client.stream('GET', tfile) as r:
            fname = r.headers.get('content-disposition').split('=')[-1]
            async with await trio.open_file(fname, 'wb') as f:
                async for chunk in r.aiter_bytes():
                    await f.write(chunk)

        df = await trio.to_thread.run_sync(partial(pd.read_excel, fname, sheet_name=3, engine="pyxlsb"))
        print(df)

if __name__ == "__main__":
    trio.run(main, 'https://rigcount.bakerhughes.com/na-rig-count')

输出:

              Country      County        Basin DrillFor  ... Week RigCount State/Province  PublishDate
0       UNITED STATES      SABINE  Haynesville      Gas  ...   13        1      LOUISIANA        40634    
1       UNITED STATES  TERREBONNE        Other      Oil  ...   13        1      LOUISIANA        40634    
2       UNITED STATES   VERMILION        Other      Gas  ...   13        1      LOUISIANA        40634    
3       UNITED STATES   VERMILION        Other      Gas  ...   13        1      LOUISIANA        40634    
4       UNITED STATES        EDDY      Permian      Oil  ...   13        1     NEW MEXICO        40634    
...               ...         ...          ...      ...  ...  ...      ...            ...          ...    
769390  UNITED STATES        KERN        Other      Oil  ...   29        1     CALIFORNIA        44393    
769391  UNITED STATES        KERN        Other      Oil  ...   29        1     CALIFORNIA        44393    
769392  UNITED STATES        KERN        Other      Oil  ...   29        1     CALIFORNIA        44393    
769393  UNITED STATES        KERN        Other      Oil  ...   29        1     CALIFORNIA        44393    
769394  UNITED STATES        KERN        Other      Oil  ...   29        1     CALIFORNIA        44393    

[769395 rows x 13 columns]
>注意:您似乎在 `pyxlsb` 阅读器中遇到了一个错误。使用索引读取工作表是原因,但使用 `sheet_name='Master Data'` 工作正常。

更新

<块引用>

问题是excel文件有2个隐藏的工作表,第二个工作表真的有1457行,主数据实际上是第四个工作表,所以sheet_name=3就可以了

上次更新

为了遵循 Python DRY Principle 。我注意到我们不需要将文件保存在本地,甚至不需要将文件可视化并存储到内存中,然后将其加载到 Pandas。

实际上,response 内容本身存储在内存中,因此我们可以通过将 r.content 直接传递给 pandas 来一次性加载所有内容!

使用以下代码:

import trio
import httpx
from bs4 import BeautifulSoup
import pandas as pd
from functools import partial


async def main(url):
    async with httpx.AsyncClient(timeout=None) as client:
        r = await client.get(url)
        soup = BeautifulSoup(r.text, 'lxml')
        tfile = soup.select_one('.file-link:-soup-contains(Table)').a['href']
        r = await client.get(tfile)
        df = await trio.to_thread.run_sync(partial(pd.read_excel, r.content, sheet_name=3, engine="pyxlsb"))
        print(df)

if __name__ == "__main__":
    trio.run(main, 'https://rigcount.bakerhughes.com/na-rig-count')