Question

我有一个DataFrame，我想计算每个id使用每个app_id的时间。由于id和app_id的数量都很大，所以我想用sparse.csr_matrix来存储它。

Input:

import pandas as pd 
import numpy as np
import random, string
def randomword(length):
    letters = string.ascii_lowercase
    nums = np.arange(1000)
    appList=[]
    for i in range(length):  
        appList.append(''.join([random.choice(letters),
        str(random.choice(nums))]))
    return appList
appList= list(randomword(300000))
timeList= [random.randrange(0, 10000, 1) for _ in range(300000)]
idList= [random.randrange(0, 70000, 1) for _ in range(300000)]

df= pd.DataFrame({'id':idList, 'app_id': appList, 'time': timeList})
print(df.head())
print('idList length:',len(set(idList)))
print('appList length:',len(set(appList)))

Output:

      id app_id  time
0  64365   c789  7366
1  54623   a391  3080
2  58511   m570  9091
3  37657   m108  4707
4   1343   m771   973

idList length: 69062
appList length: 26000

Expected:

为方便起见，我以df.head()为例。下面的DataFrame是我想要的。而且我希望DataFrame可以存储为csr_matrix。

      id   c789  a391  m570  m108  m771
0  64365   7366    0     0     0     0
1  54623      0  3080    0     0     0
2  58511      0    0   9091    0     0
3  37657      0    0     0  4707     0
4   1343      0    0     0     0   973

如您所见，id的数目为69062，而app_id的数目为 26000，因此我希望得到形状为csr_matrix的{{1}}。

如何将一列的每个元素扩展为一列并使用稀疏矩阵存储它

0 个答案: