Question

我的目标是从网站上访问一些数据并将这些数据存储在内存中（而不是在本地下载），以便我可以做一些进一步的操作。这是我的python代码：

import pandas as pd 
import requests
from requests.auth import HTTPBasicAuth

year = 2019
month_str = 'Jan'
date = 2
month = 1

user = XXXX
password = XXXX

response = requests.get('http_some_url/%i/%s/%02d/%i%02d%02d.gz' % (year,month_str,date,year,month,date), auth = HTTPBasicAuth(user, password))
x = pd.read_csv(response.text, compression='gzip', sep = '|')
print(x.head())

数据位于文件名为“ year + month + date.gz”的文件夹“ year” =>“ month_str” =>“ date”中。当我运行此代码时，它将返回

"ValueError: embedded null byte".

什么是正确的方法？

更新：

print(response)
<Response [200]>

当我打印响应时，它返回200，这意味着它有响应。

更新：

response = requests.get('http_some_url/%i/%s/%02d/%i%02d%02d.gz' % (year,month_str,date,year,month,date), auth = HTTPBasicAuth(user, password))
print(response)
x = pd.read_csv(response.content, compression='gzip', sep = '|')
print(x)

将response.text替换为response.content并打印后，它返回：

AttributeError: 'bytes' object has no attribute 'read'

这是该gzip文件中的一些示例：

093013399690000|310001|C|A|59.85|73.15|A||
093030000913000|353701|C|A|59.85|73.15|B||
093100000411000|460501|C|A|59.85|73.15|B||
093130000630000|697401|C|A|59.85|73.15|B||
093200000464000|841501|C|A|59.85|73.15|B||
093230000508000|1013801|C|A|59.85|73.15|B||
093300000550000|1148701|C|A|59.85|73.15|B||
093330000394000|1313701|C|A|59.85|73.15|B||
093400000590000|1485801|C|A|59.85|73.15|B||
093430000495000|1652601|C|A|59.85|73.15|B||
093500000593000|1856201|C|A|59.85|73.15|B||

Answer 1

似乎您的字符串格式错误。

f'http_some_url/{year}/{month_str}/{date}/{year}{month}{date}.gz'

Answer 2

您只需要熊猫：

api_id = xxx
api_hash = xxx
channel_name = xxx
client = TelegramClient('readChannel', api_id, api_hash)

# Connect to telegram
try:
    client.connect()
    client.start()
except:
    print ("Could not connect to telegram")

# Connect to channel and get posts
try:
    channel_entity=client.get_entity(channel_name)
    posts = client(GetHistoryRequest(
        peer=channel_entity,
        limit=1000000,
        offset_date=0,
        offset_id=0,
        max_id=0,
        min_id=0,
        add_offset=0,
        hash=0))
except:
    print ("Could not connect to channel ", channel_name)

这是概念证明：

import pandas as pd

year = 2019
month_str = 'Jan'
date = 2
month = 1

user = XXXX
password = XXXX

gzip_url = f'http://{user}:{password}@some_url/{year}/{month_str}/{date:02d}/{year}{month:02d}{date:02d}.gz'

x = pd.read_csv(gzip_url, compression='gzip', sep = '|')
print(x.head())

正如我们在聊天中所讨论的，这里是使用Python 3.7.5 (default, Oct 17 2019, 12:16:48) [GCC 9.2.1 20190827 (Red Hat 9.2.1-1)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pandas as pd >>> gzip_file = 'http://127.0.0.1:8000/testfile.gz' >>> df = pd.read_csv(gzip_file, compression='gzip', sep='|') >>> df.head() 093013399690000 310001 C A 59.85 73.15 A.1 Unnamed: 7 Unnamed: 8 0 93030000913000 353701 C A 59.85 73.15 B NaN NaN 1 93100000411000 460501 C A 59.85 73.15 B NaN NaN 2 93130000630000 697401 C A 59.85 73.15 B NaN NaN 3 93200000464000 841501 C A 59.85 73.15 B NaN NaN 4 93230000508000 1013801 C A 59.85 73.15 B NaN NaN >>>的替代方法：

requests

如何处理数据而不将其保存在本地

2 个答案: