Question

这与大多数人在列表和数据框之间转换时想做的事情相反。

我正在寻求将大型数据框（10M +行，20 +列）转换为字符串列表，其中每个条目都是数据帧中每一行的字符串表示形式。我可以使用pandas的to_csv()方法来执行此操作，但是我想知道是否有更快的方法，因为这被证明是我的代码中的瓶颈。

最小工作示例：

import numpy as np
import pandas as pd

# Create the initial dataframe.
size = 10000000
cols = list('abcdefghijklmnopqrstuvwxyz')
df = pd.DataFrame()
for col in cols:
    df[col] = np.arange(size)
    df[col] = "%s_" % col + df[col].astype(str)

# Convert to the required list structure
ret_val = _df_.to_csv(index=False, header=False).split("\n")[:-1]

对于我的Core i9的单个线程上的10,000,000行的数据帧，以上代码的转换过程大约需要90秒，并且与CPU高度相关。如果可能的话，我希望将其降低一个数量级。

编辑： 我不是要将数据保存到.csv或文件中。我只是想将数据框转换为字符串数组。

EDIT： 输入/输出示例只有5列：

In  [1]: df.head(10)
Out [1]:    a       b       c       d       e
         0  a_0     b_0     c_0     d_0     e_0
         1  a_1     b_1     c_1     d_1     e_1
         2  a_2     b_2     c_2     d_2     e_2
         3  a_3     b_3     c_3     d_3     e_3
         4  a_4     b_4     c_4     d_4     e_4
         5  a_5     b_5     c_5     d_5     e_5
         6  a_6     b_6     c_6     d_6     e_6
         7  a_7     b_7     c_7     d_7     e_7
         8  a_8     b_8     c_8     d_8     e_8
         9  a_9     b_9     c_9     d_9     e_9

In  [2]: ret_val[:10]
Out [2]: ['a_0,b_0,c_0,d_0,e_0',
          'a_1,b_1,c_1,d_1,e_1',
          'a_2,b_2,c_2,d_2,e_2',
          'a_3,b_3,c_3,d_3,e_3',
          'a_4,b_4,c_4,d_4,e_4',
          'a_5,b_5,c_5,d_5,e_5',
          'a_6,b_6,c_6,d_6,e_6',
          'a_7,b_7,c_7,d_7,e_7',
          'a_8,b_8,c_8,d_8,e_8',
          'a_9,b_9,c_9,d_9,e_9']

Answer 1

multiprocessing使我的速度提高了约2.5倍...

import multiprocessing

# df from OPs above code available in global scope

def fn(i):
    return df[i:i+1000].to_csv(index=False, header=False).split('\n')[:-1]

with multiprocessing.Pool() as pool:
    result = []
    for a in pool.map(fn, range(0, len(df), 1000)):
        result.extend(a)

在笔记本电脑上将100万行的总时间从6.8秒减少到2.8秒，因此有望扩展到i9 CPU上的更多内核。

这取决于Unix fork语义与子进程共享数据帧，显然还有更多工作要做，但可能会有所帮助...

将Massifox的numpy.savetxt建议与multiprocessing一起使用可以将这段时间降低到2.0秒，只需map以下功能即可：

def fn2(i):
    with StringIO() as fd:
        np.savetxt(fd, df[i:i+N], fmt='%s', delimiter=',')
        return fd.getvalue().split('\n')[:-1]

结果基本相同

您说“数据帧是一个类中的变量”的注释可以用多种不同的方式来修复。一种简单的方法是将数据帧传递到Pool initializer，此时将不会选择该数据帧（无论如何在Unix下），并将对它的引用存储在某个地方的全局变量中。然后，每个工作进程都可以使用此引用，例如：

def stash_df(df):
    global the_df
    the_df = df

def fn(i):
    with StringIO() as fd:
        np.savetxt(fd, the_df[i:i+N], fmt='%s', delimiter=',')
        return fd.getvalue().split('\n')[:-1]

with multiprocessing.Pool(initializer=stash_df, initargs=(df,)) as pool:
    result = []
    for a in pool.map(fn, range(0, len(df), N)):
        result.extend(a)

只要单个数据帧使用每个Pool，就可以了

Answer 2

您可以尝试其他方法来加快将数据写入磁盘的速度：

写入压缩文件可以将写入速度提高10倍

df.to_csv('output.csv.gz' , header=True , index=False , chunksize=100000 , compression='gzip' , encoding='utf-8')
选择最适合您的块大小。
切换为hdf格式：

df.to_hdf(r'output.h5', mode='w')
根据krassowski answer，使用numpy。例如，使用以下df：

df=pd.DataFrame({'A':range(1000000)}) df['B'] = df.A + 1.0 df['C'] = df.A + 2.0 df['D'] = df.A + 3.0

熊猫到csv：

df.to_csv('pandas_to_csv', index=False)
在我的计算机上，每个循环耗时6.45 s±1.05 s（平均±标准偏差，共运行7次，每个循环1次）。

对csv numpy：

savetxt( 'numpy_savetxt', aa.values, fmt='%d,%.1f,%.1f,%.1f', header=','.join(aa.columns), comments='')
在我的计算机上，每个循环耗时3.38 s±224毫秒（平均±标准偏差，共运行7次，每个循环1次）
使用Pandaral·lel。
是一个简单高效的工具，可在所有CPU上并行化Pandas计算（仅限Linux和MacOS）。如何仅用一行代码来显着加快熊猫的计算速度。太酷了！
您可以考虑用DASK数据框替换Pandas数据框。 CSV API与熊猫非常相似。

Answer 3

使用字典会稍微改善性能：

#First DataFrame
d = {'technology': ['EAF', 'EAF', 'EAF', 'BOF', 'BOF', 'BOF'], 'equip_detail1': [150, 130, 100, 200, 200, 150], 'equip_number' : [1, 2, 3, 1, 2, 3], 'capacity_actual': [2400, 2080, 1600, 3200, 3200, 2400], 'start_year': [1992, 1993, 1994, 1989, 1990, 1991], 'closure_year': [ '', 2002, '', '', 2001, 2011] }
rswcapacity = pd.DataFrame(data = d)
rswcapacity['closure_year'].replace('', np.nan, inplace = True)

#Second DataFrame    
annualcapacity = pd.DataFrame(columns=['years', 'capacity'])
annualcapacity ['years'] = [1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020]

#Neither of the attempts below yields the desired results:
    for y in years:
        annualcapacity['capacity'].append(rswcapacity['capacity_actual'].apply(lambda x : x['capacity_actual'].sum() (x['start_year'] >= y & (x['closure_year'] <= y | x['closure_year'].isnull()))).sum())
    annualcapacity

#other attempt:
   for y in years:
         if (rswcapacity['start_year'] >= y).any() & ((rswcapacity['closure_year'].isnull()).any() | (rswcapacity['closure_year'] <= y).any()):
            annualcapacity['capacity'].append(rswcapacity['capacity_actual'].sum())
    annualcapacity

词典版本：

size = 100000
cols = list('abcdefghijklmnopqrstuvwxyz')

您的示例：

%%timeit
dict_res= {}
for col in cols:
    dict_res[col] = ["%s_%d" % (col, n) for n in np.arange(size)]
df2 = pd.DataFrame(dict_res)
# 1.56 s ± 99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

多处理

使用多重处理，代码如下：

%%timeit
df = pd.DataFrame()
for col in cols:
    df[col] = np.arange(size)
    df[col] = "%s_" % col + df[col].astype(str)
# 1.91 s ± 84.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

现在我无法在我的PC上运行它。但是性能可能会提高。

Answer 4

尝试以下解决方案：

    list_of_string = df.head(5).set_index(cols[0]).to_string(header=False).split('\n')[1:]

     # output: 
['a_0  b_1  c_1  d_1  e_1  f_1  g_1  h_1  i_1  j_1  k_1  l_1  m_1  n_1  o_1  p_1  q_1  r_1  s_1  t_1  u_1  v_1  w_1  x_1  y_1  z_1',
     'a_1  b_2  c_2  d_2  e_2  f_2  g_2  h_2  i_2  j_2  k_2  l_2  m_2  n_2  o_2  p_2  q_2  r_2  s_2  t_2  u_2  v_2  w_2  x_2  y_2  z_2',
     'a_2  b_3  c_3  d_3  e_3  f_3  g_3  h_3  i_3  j_3  k_3  l_3  m_3  n_3  o_3  p_3  q_3  r_3  s_3  t_3  u_3  v_3  w_3  x_3  y_3  z_3',
     'a_3  b_4  c_4  d_4  e_4  f_4  g_4  h_4  i_4  j_4  k_4  l_4  m_4  n_4  o_4  p_4  q_4  r_4  s_4  t_4  u_4  v_4  w_4  x_4  y_4  z_4',
     'a_4  b_5  c_5  d_5  e_5  f_5  g_5  h_5  i_5  j_5  k_5  l_5  m_5  n_5  o_5  p_5  q_5  r_5  s_5  t_5  u_5  v_5  w_5  x_5  y_5  z_5']

如果要用逗号替换空格：

[s.replace('  ', ',') for s in list_of_string]
# output:
['a_0,b_1,c_1,d_1,e_1,f_1,g_1,h_1,i_1,j_1,k_1,l_1,m_1,n_1,o_1,p_1,q_1,r_1,s_1,t_1,u_1,v_1,w_1,x_1,y_1,z_1',
 'a_1,b_2,c_2,d_2,e_2,f_2,g_2,h_2,i_2,j_2,k_2,l_2,m_2,n_2,o_2,p_2,q_2,r_2,s_2,t_2,u_2,v_2,w_2,x_2,y_2,z_2',
 'a_2,b_3,c_3,d_3,e_3,f_3,g_3,h_3,i_3,j_3,k_3,l_3,m_3,n_3,o_3,p_3,q_3,r_3,s_3,t_3,u_3,v_3,w_3,x_3,y_3,z_3',
 'a_3,b_4,c_4,d_4,e_4,f_4,g_4,h_4,i_4,j_4,k_4,l_4,m_4,n_4,o_4,p_4,q_4,r_4,s_4,t_4,u_4,v_4,w_4,x_4,y_4,z_4',
 'a_4,b_5,c_5,d_5,e_5,f_5,g_5,h_5,i_5,j_5,k_5,l_5,m_5,n_5,o_5,p_5,q_5,r_5,s_5,t_5,u_5,v_5,w_5,x_5,y_5,z_5']

您可以根据我在previous answers中给您的建议来加快此代码的速度。

提示：DASK，Pandaral·lel和多处理功能是您的朋友！

有没有一种快速的方法可以将列的Pandas数据框转换为字符串列表？

4 个答案:

词典版本：

您的示例：

多处理