Question

我正在尝试从网站 (https://rigcount.bakerhughes.com/na-rig-count) 获取 Excel 文件，将其下载并保存到内存中，以便使用 Pandas 进行读取。该文件是一个 .xlsb，超过 700,000 行。

使用我正在使用的代码，我只能获得 1457 行...我尝试使用 chunksize 但它没有用。

这是我的代码：

trending images

我尝试将其保存在本地并重新打开，但我无法解决编码问题。

谢谢你的帮助！ :)

Answer 1

import trio
import httpx
from bs4 import BeautifulSoup
import pandas as pd
from functools import partial


async def main(url):
    async with httpx.AsyncClient(timeout=None) as client:
        r = await client.get(url)
        soup = BeautifulSoup(r.text, 'lxml')
        tfile = soup.select_one('.file-link:-soup-contains(Table)').a['href']
        async with client.stream('GET', tfile) as r:
            fname = r.headers.get('content-disposition').split('=')[-1]
            async with await trio.open_file(fname, 'wb') as f:
                async for chunk in r.aiter_bytes():
                    await f.write(chunk)

        df = await trio.to_thread.run_sync(partial(pd.read_excel, fname, sheet_name=3, engine="pyxlsb"))
        print(df)

if __name__ == "__main__":
    trio.run(main, 'https://rigcount.bakerhughes.com/na-rig-count')

输出：

              Country      County        Basin DrillFor  ... Week RigCount State/Province  PublishDate
0       UNITED STATES      SABINE  Haynesville      Gas  ...   13        1      LOUISIANA        40634    
1       UNITED STATES  TERREBONNE        Other      Oil  ...   13        1      LOUISIANA        40634    
2       UNITED STATES   VERMILION        Other      Gas  ...   13        1      LOUISIANA        40634    
3       UNITED STATES   VERMILION        Other      Gas  ...   13        1      LOUISIANA        40634    
4       UNITED STATES        EDDY      Permian      Oil  ...   13        1     NEW MEXICO        40634    
...               ...         ...          ...      ...  ...  ...      ...            ...          ...    
769390  UNITED STATES        KERN        Other      Oil  ...   29        1     CALIFORNIA        44393    
769391  UNITED STATES        KERN        Other      Oil  ...   29        1     CALIFORNIA        44393    
769392  UNITED STATES        KERN        Other      Oil  ...   29        1     CALIFORNIA        44393    
769393  UNITED STATES        KERN        Other      Oil  ...   29        1     CALIFORNIA        44393    
769394  UNITED STATES        KERN        Other      Oil  ...   29        1     CALIFORNIA        44393    

[769395 rows x 13 columns]

~~>注意：您似乎在 `pyxlsb` 阅读器中遇到了一个错误。使用索引读取工作表是原因，但使用 `sheet_name='Master Data'` 工作正常。~~

更新：

<块引用>

问题是excel文件有2个隐藏的工作表，第二个工作表真的有1457行，主数据实际上是第四个工作表，所以sheet_name=3就可以了

上次更新：

为了遵循 Python DRY Principle 。我注意到我们不需要将文件保存在本地，甚至不需要将文件可视化并存储到内存中，然后将其加载到 Pandas。

实际上，response 内容本身存储在内存中，因此我们可以通过将 r.content 直接传递给 pandas 来一次性加载所有内容！

使用以下代码：

import trio
import httpx
from bs4 import BeautifulSoup
import pandas as pd
from functools import partial


async def main(url):
    async with httpx.AsyncClient(timeout=None) as client:
        r = await client.get(url)
        soup = BeautifulSoup(r.text, 'lxml')
        tfile = soup.select_one('.file-link:-soup-contains(Table)').a['href']
        r = await client.get(tfile)
        df = await trio.to_thread.run_sync(partial(pd.read_excel, r.content, sheet_name=3, engine="pyxlsb"))
        print(df)

if __name__ == "__main__":
    trio.run(main, 'https://rigcount.bakerhughes.com/na-rig-count')

抓取 Excel 文件并即时阅读

1 个答案: