假设我在python中有一个自定义类,它具有属性val
。如果我有一个包含这些对象列的pandas数据框,我该如何访问此属性并使用此值创建一个新列?
示例数据:
df
Out[46]:
row custom_object
1 foo1
2 foo2
3 foo3
4 foo4
Name: book, dtype: object
自定义对象属于Foo类:
class Foo:
def __init__(self, val):
self.val = val
我知道使用实例属性创建新列的唯一方法是使用apply
和lambda
组合,这对大型数据集来说很慢:
df['custom_val'] = df['custom_object'].apply(lambda x: x.val)
有更有效的方法吗?
答案 0 :(得分:0)
您可以使用列表理解:
df['custom_val'] = [foo.val for foo in df['custom_object']]
<强>计时强>
# Set-up 100k Foo objects.
vals = [np.random.randn() for _ in range(100000)]
foos = [Foo(val) for val in vals]
df = pd.DataFrame(foos, columns=['custom_object'])
# 1) OP's apply method.
%timeit df['custom_object'].apply(lambda x: x.val)
# 10 loops, best of 3: 26.7 ms per loop
# 2) Using a list comprehension instead.
%timeit [foo.val for foo in df['custom_object']]
# 100 loops, best of 3: 11.7 ms per loop
# 3) For reference with the original list of objects (slightly faster than 2) above).
%timeit [foo.val for foo in foos]
# 100 loops, best of 3: 9.79 ms per loop
# 4) And just on the original list of raw values themselves.
%timeit [val for val in vals]
# 100 loops, best of 3: 4.91 ms per loop
如果您有原始值列表,则可以直接指定它们:
# 5) Direct assignment to list of values.
%timeit df['v'] = vals
# 100 loops, best of 3: 5.88 ms per loop
答案 1 :(得分:0)
设置代码:
import operator
import random
from dataclasses import dataclass
import numpy as np
import pandas as pd
@dataclass
class SomeObj:
val: int
df = pd.DataFrame(data={f"col_1": [SomeObj(random.randint(0, 10000)) for _ in range(10000000)]})
df['col_1'].map(lambda elem: elem.val)
时间:〜3.2秒
df['col_1'].map(operator.attrgetter('val'))
时间:〜2.7秒
[elem.val for elem in df['col_1']]
时间:〜1.4秒
注意:请记住,此解决方案会产生不同的结果类型,在某些情况下可能会出现问题。