我有大约1,100万行,有21列,所以:
area_id_number, c000, c001, c002 ...
01293091302390, 2, 2, 0 ...
01293091302391, 2, 0, 0 ...
01293091302392, 3, 1, 1 ...
我想结束这样的事情:
value_id, area_id_number, value_type
1, 01293091302390, c000
2, 01293091302390, c000
3, 01293091302390, c001
4, 01293091302390, c001
5, 01293091302391, c000
6, 01293091302391, c000
7, 01293091302392, c000
8, 01293091302392, c000
9, 01293091302392, c000
10, 01293091302392, c001
11, 01293091302392, c002
...
我还没找到方法做到这一点。我看过unpack / pivot / deaggregate(用这些术语找不到合适的解决方案......)
这个问题的第二部分是,我有任何记忆问题吗?我应该考虑哪些有效率的东西?我最终应该有大约1.4亿行。
答案 0 :(得分:1)
主进程由ndarray.repeat()
计算,我没有足够的内存来测试11M行,但这里是代码:
首先创建测试数据:
import numpy as np
import pandas as pd
#create sample data
nrows = 500000
ncols = 21
nones = int(70e6)
ntwos = int(20e6)
nthrees = int(10e6)
rint = np.random.randint
counts = np.zeros((nrows, ncols), dtype=np.int8)
counts[rint(0, nrows, nones), rint(0, ncols, nones)] = 1
counts[rint(0, nrows, ntwos), rint(0, ncols, ntwos)] = 2
counts[rint(0, nrows, nthrees), rint(0, ncols, nthrees)] = 3
columns = ["c%03d" % i for i in range(ncols)]
index = ["%014d" % i for i in range(nrows)]
df = pd.DataFrame(counts, index=index, columns=columns)
以下是流程代码:
idx, col = np.where(df.values)
n = df.values[idx, col]
idx2 = df.index.values[idx.repeat(n)]
col2 = df.columns.values[col.repeat(n)]
df2 = pd.DataFrame({"id":idx2, "type":col2})