我正在尝试从网站 (https://rigcount.bakerhughes.com/na-rig-count) 获取 Excel 文件,将其下载并保存到内存中,以便使用 Pandas 进行读取。该文件是一个 .xlsb,超过 700,000 行。
使用我正在使用的代码,我只能获得 1457 行...我尝试使用 chunksize 但它没有用。
这是我的代码:
trending images
我尝试将其保存在本地并重新打开,但我无法解决编码问题。
谢谢你的帮助! :)
答案 0 :(得分:2)
import trio
import httpx
from bs4 import BeautifulSoup
import pandas as pd
from functools import partial
async def main(url):
async with httpx.AsyncClient(timeout=None) as client:
r = await client.get(url)
soup = BeautifulSoup(r.text, 'lxml')
tfile = soup.select_one('.file-link:-soup-contains(Table)').a['href']
async with client.stream('GET', tfile) as r:
fname = r.headers.get('content-disposition').split('=')[-1]
async with await trio.open_file(fname, 'wb') as f:
async for chunk in r.aiter_bytes():
await f.write(chunk)
df = await trio.to_thread.run_sync(partial(pd.read_excel, fname, sheet_name=3, engine="pyxlsb"))
print(df)
if __name__ == "__main__":
trio.run(main, 'https://rigcount.bakerhughes.com/na-rig-count')
输出:
Country County Basin DrillFor ... Week RigCount State/Province PublishDate
0 UNITED STATES SABINE Haynesville Gas ... 13 1 LOUISIANA 40634
1 UNITED STATES TERREBONNE Other Oil ... 13 1 LOUISIANA 40634
2 UNITED STATES VERMILION Other Gas ... 13 1 LOUISIANA 40634
3 UNITED STATES VERMILION Other Gas ... 13 1 LOUISIANA 40634
4 UNITED STATES EDDY Permian Oil ... 13 1 NEW MEXICO 40634
... ... ... ... ... ... ... ... ... ...
769390 UNITED STATES KERN Other Oil ... 29 1 CALIFORNIA 44393
769391 UNITED STATES KERN Other Oil ... 29 1 CALIFORNIA 44393
769392 UNITED STATES KERN Other Oil ... 29 1 CALIFORNIA 44393
769393 UNITED STATES KERN Other Oil ... 29 1 CALIFORNIA 44393
769394 UNITED STATES KERN Other Oil ... 29 1 CALIFORNIA 44393
[769395 rows x 13 columns]
更新:
<块引用>问题是excel文件有2个隐藏的工作表,第二个工作表真的有1457行,主数据实际上是第四个工作表,所以sheet_name=3就可以了
上次更新:
为了遵循 Python DRY Principle 。我注意到我们不需要将文件保存在本地,甚至不需要将文件可视化并存储到内存中,然后将其加载到 Pandas。
实际上,response
内容本身存储在内存中,因此我们可以通过将 r.content
直接传递给 pandas 来一次性加载所有内容!
使用以下代码:
import trio
import httpx
from bs4 import BeautifulSoup
import pandas as pd
from functools import partial
async def main(url):
async with httpx.AsyncClient(timeout=None) as client:
r = await client.get(url)
soup = BeautifulSoup(r.text, 'lxml')
tfile = soup.select_one('.file-link:-soup-contains(Table)').a['href']
r = await client.get(tfile)
df = await trio.to_thread.run_sync(partial(pd.read_excel, r.content, sheet_name=3, engine="pyxlsb"))
print(df)
if __name__ == "__main__":
trio.run(main, 'https://rigcount.bakerhughes.com/na-rig-count')