Question

My sample of data is really big (1.2 million documents), and I need to create and analyse data on only one "pandas dataframe". For now my code looks like this:

conn = psycopg2.connect("dbname=monty user=postgres host=localhost password=postgres")
cur = conn.cursor('aggre')
cur.execute("SELECT * FROM binance.zrxeth_ob_indicators;")
row = cur.fetchall()
df = pd.DataFrame(row,columns = ['timestamp', 'topAsk', 'topBid', 'CPA', 'midprice', 'CPB', 'spread', 'CPA%', 'CPB%'])

But it will take ages to localy upload everything in the variable df? What I tried so far was to do this:

for row in cur:
      dfsub = pd.DataFrame(row,columns=['timestamp', 'topAsk', 'topBid', 'CPA', 'midprice', 'CPB', 'spread', 'CPA%', 'CPB%'])
      df = df.concat([df,dfsub])

but it gives me the following error: DataFrame constructor not properly called!

any idea? Thanks!

Answer 1

you can do something like this

class Postgres:
def __init__(self, host, database, user=None, password='', schema='public'):
    self.user = user or getpass.getuser()
    self.database = database
    self.host = host
    self.engine = self.create_engine(self.host, self.database, self.user, password)
    self.schema = schema

@staticmethod
def create_engine(host, database, user, password):
    return psycopg2.connect("postgresql://{user}:{password}@{host}/{database}".format(
        host=host,
        database=database,
        user=user,
        password=password
    ))

def execute(self, query: object) -> object:
    """
    :param query:
    :return: pd.Dataframe()
    """
    result_df = pd.read_sql(query, self.engine)
    self.engine.commit()
    return result_df

with this you use the optimized DataFrame creation from a postgres result of pandas.

But reasoned by your dataset it takes some time to read all the data into memory

Answer 2

Pandas具有不错的内置read_sql方法，应该非常有效

即只是做：

df = pd.read_sql("SELECT * FROM binance.zrxeth_ob_indicators", conn)

它应该可以正常工作...

在自己的120万行上并不多，考虑到您的列数/名称，它可能<300MB RAM（每个值30个字节* 9列* 1.2e6行），并且在最近的计算机上应该花费<10秒< / p>

Answer 3

我认为，由于您的文档集很大，因此无论您如何处理，都需要很长时间才能将其加载到内存中。我建议，如果您不需要一次将整个数据集保存在内存中，则可以使用pandas内置的加载块方法。这使您可以按顺序加载和处理为此用例设计的数据块。

例如参见此问题 How to read a 6 GB csv file with pandas

Pandas: how to work with really big data?

3 个答案: