Question

我有一个非常大的DataFrame，其中一列（COL）包含值的范围（即列表）。我想将此COL变成带有特定编号的单独列，如果特定编号在COL中，则将其包含1，否则为0。

以下是我目前的做法。但是，这很慢，因为观察次数和MAX_VALUE很高。

import pandas as pd
import numpy as np

OBSERVATIONS = 100000 # number of values 600000
MAX_VALUE = 400 # 400

_ = pd.DataFrame({
    'a':np.random.randint(2,20,OBSERVATIONS),
    'b':np.random.randint(30,MAX_VALUE,OBSERVATIONS)
})


_['res'] = _.apply(lambda x: range(x['a'],x['b']),axis=1)

for i in range(MAX_VALUE):
    _[f'{i}'] = _['res'].apply(lambda x: 1 if i in x else 0)

Answer 1

您可以尝试在numpy中进行计算，然后将numpy数组插入数据框。这快了大约5倍：

import pandas as pd
import numpy as np
import time

OBSERVATIONS = 100_000 # number of values 600000
MAX_VALUE = 400 # 400

_ = pd.DataFrame({
    'a':np.random.randint(2,20,OBSERVATIONS),
    'b':np.random.randint(30,MAX_VALUE,OBSERVATIONS)
})
_['res'] = _.apply(lambda x: range(x['a'],x['b']),axis=1)

res1 = _.copy()

start = time.time()
for i in range(MAX_VALUE):
    res1[f'{i}'] = res1['res'].apply(lambda x: 1 if i in x else 0)
print(f'original: {time.time() - start}')

start = time.time()
z = np.zeros((len(_), MAX_VALUE), dtype=np.int64)
for i,r in enumerate(_.res):
    z[i,range(r.start,r.stop)]=1
res2 = pd.concat([_, pd.DataFrame(z)], axis=1)
res2.columns = list(map(str, res2.columns))
print(f'new     : {time.time() - start}')

assert res1.equals(res2)

输出：

original: 23.649751663208008
new     : 4.586429595947266

熊猫如何将列表的一栏变成多列？

1 个答案: