说,我们建立一个df:
import pandas as pd
import random as randy
import numpy as np
df_size = int(1e6)
df = pd.DataFrame({'first': randy.sample(np.repeat([np.NaN,'Cat','Dog','Bear','Fish'],df_size),df_size),
'second': randy.sample(np.repeat([np.NaN,np.NaN,'Cat','Dog'],df_size),df_size),
'value': range(df_size)},
index=randy.sample(pd.date_range('2013-02-01 09:00:00.000000',periods=1e6,freq='U'),df_size)).sort_index()
它看起来像这样:
first second value
2013-02-01 09:00:00 Fish Cat 95409
2013-02-01 09:00:00.000001 Dog Dog 323089
2013-02-01 09:00:00.000002 Fish Cat 785925
2013-02-01 09:00:00.000003 Dog Cat 866171
2013-02-01 09:00:00.000004 nan nan 665702
2013-02-01 09:00:00.000005 Cat nan 104257
2013-02-01 09:00:00.000006 nan nan 152926
2013-02-01 09:00:00.000007 Bear Cat 707747
我想要的是“第二列”中的每个值,我想要第一个的最后一个'值'。
first second value new_value
2013-02-01 09:00:00 Fish Cat 95409 NaN
2013-02-01 09:00:00.000001 Dog Dog 323089 323089
2013-02-01 09:00:00.000002 Fish Cat 785925 NaN
2013-02-01 09:00:00.000003 Dog Cat 866171 NaN
2013-02-01 09:00:00.000004 nan nan 665702 NaN
2013-02-01 09:00:00.000005 Cat nan 104257 NaN
2013-02-01 09:00:00.000006 nan nan 152926 NaN
2013-02-01 09:00:00.000007 Bear Cat 707747 104257
也许,这不是绝对最好的例子,但在底部,当'第二'是'猫'时,我想要'第一'是'猫'的最新价值
真正的数据集有1000多个类别,因此循环遍历符号并执行asof()似乎过于昂贵。我在Cython中传递字符串从来没有运气,但我想将符号映射到整数并做一个暴力循环会起作用 - 我希望有更多的pythonic。 (那仍然相当快)
参考,有点脆弱的Cython黑客将是:
%%cython
import numpy as np
import sys
cimport cython
cimport numpy as np
ctypedef np.double_t DTYPE_t
def last_of(np.ndarray[DTYPE_t, ndim=1] some_values,np.ndarray[long, ndim=1] first_sym,np.ndarray[long, ndim=1] second_sym):
cdef long val_len = some_values.shape[0], sym1_len = first_sym.shape[0], sym2_len = second_sym.shape[0], i = 0
assert(sym1_len==sym2_len)
assert(val_len==sym1_len)
cdef int enum_space_size = max(first_sym)+1
cdef np.ndarray[DTYPE_t, ndim=1] last_values = np.zeros(enum_space_size, dtype=np.double) * np.NaN
cdef np.ndarray[DTYPE_t, ndim=1] res = np.zeros(val_len, dtype=np.double) * np.NaN
for i in range(0,val_len):
if first_sym[i]>=0:
last_values[first_sym[i]] = some_values[i]
if second_sym[i]<0 or second_sym[i]>=enum_space_size:
res[i] = np.NaN
else:
res[i] = last_values[second_sym[i]]
return res
然后有些词汇取代废话:
syms= unique(df['first'].values)
enum_dict = dict(zip(syms,range(0,len(syms))))
enum_dict['nan'] = -1
df['enum_first'] = df['first'].replace(enum_dict)
df['enum_second'] = df['second'].replace(enum_dict)
df['last_value'] = last_of(df.value.values*1.0,df.enum_first.values.astype(int64),df.enum_second.values.astype(int64))
这有一个问题,如果'第二列'有任何值不在第一列,你就会遇到问题。 (我不确定一个快速的方法来解决这个问题......如果你把'驴'添加到第二个那里)
对于整个混乱,每1000万行的cythonic愚蠢版本约为21秒,但对于cython部分只有~2。 (可以更快地获得相当数量)
@HYRY - 我认为这是一个非常可靠的解决方案;在一台拥有1000万行的DF上,在我的笔记本电脑上,这对我来说需要大约30秒。
鉴于我不知道一个简单的方法来处理第二个列表除了一个相当昂贵的isin之外没有第一个列表的条目,我认为HYRY的python版本非常好。
答案 0 :(得分:3)
如何使用dict保留每个类别的最后一个值,并将其放在DataFrame中的所有行中:
import pandas as pd
import random as randy
import numpy as np
np.random.seed(1)
df_size = int(1e2)
df = pd.DataFrame({'first': randy.sample(np.repeat([None,'Cat','Dog','Bear','Fish'],df_size),df_size),
'second': randy.sample(np.repeat([None,None,'Cat','Dog'],df_size),df_size),
'value': range(df_size)},
index=randy.sample(pd.date_range('2013-02-01 09:00:00.000000',periods=1e6,freq='U'),df_size)).sort_index()
last_values = {}
new_values = []
for row in df.itertuples():
t, f, s, v = row
last_values[f] = v
if s is None:
new_values.append(None)
else:
new_values.append(last_values.get(s, None))
df["new_value"] = new_values
结果是
first second value new_value
2013-02-01 09:00:00.010373 Cat None 87 None
2013-02-01 09:00:00.013015 Cat Dog 69 None
2013-02-01 09:00:00.024910 Fish Cat 1 69
2013-02-01 09:00:00.025943 Cat None 98 None
2013-02-01 09:00:00.041318 Fish Dog 66 None
2013-02-01 09:00:00.057894 None None 36 None
2013-02-01 09:00:00.059678 None None 50 None
2013-02-01 09:00:00.067228 Bear None 38 None
2013-02-01 09:00:00.095867 Bear Cat 84 98
2013-02-01 09:00:00.096867 Dog Cat 97 98
2013-02-01 09:00:00.101540 Dog Dog 76 76
2013-02-01 09:00:00.106753 Dog None 22 None
2013-02-01 09:00:00.138936 None None 8 None
2013-02-01 09:00:00.139273 Bear Cat 2 98
2013-02-01 09:00:00.143180 Fish None 94 None
2013-02-01 09:00:00.184757 None Cat 73 98
2013-02-01 09:00:00.193063 None None 5 None
2013-02-01 09:00:00.231056 Fish Cat 62 98
2013-02-01 09:00:00.237658 None None 64 None
2013-02-01 09:00:00.240178 Bear Dog 80 22
答案 1 :(得分:0)
老问题我知道,但这是一个避免任何Python循环的解决方案。
第一步是为每个类别获取'value'
的时间序列。
您可以通过取消堆叠来执行此操作:
first_values = df.dropna(subset=['first']).set_index('first', append=True).value.unstack()
second_values = df.dropna(subset=['second']).set_index('second', append=True).value.unstack()
请注意,只有当列包含真正的NaN
值而不是'nan'
字符串时才会有效(如果需要,请执行df = df.replace('nan', np.nan)
。)
然后,您可以通过向前填充first_values
获取最后一个第一个值,重新编制类似second_values
的索引,再次堆叠并使用原始'time', 'second'
对索引到结果中:
ix = pd.MultiIndex.from_arrays([df.index, df.second])
new_value = first_values.ffill().reindex_like(second_values).stack().reindex(ix)
df['new_value'] = new_value.values
In [1649]: df
Out[1649]:
first second value new_value
2013-02-01 09:00:00.000000 Fish Cat 95409 NaN
2013-02-01 09:00:00.000001 Dog Dog 323089 323089
2013-02-01 09:00:00.000002 Fish Cat 785925 NaN
2013-02-01 09:00:00.000003 Dog Cat 866171 NaN
2013-02-01 09:00:00.000004 NaN NaN 665702 NaN
2013-02-01 09:00:00.000005 Cat NaN 104257 NaN
2013-02-01 09:00:00.000006 NaN NaN 152926 NaN
2013-02-01 09:00:00.000007 Bear Cat 707747 104257