我正在尝试处理来自三个(csv)文件的数据,例如p,c,f:
例如,分别加载到df_p,df_c和df_f:
>>> df_p
p1 p2 p3 p4 p5
2614 104 104 102 102 102
3735 100 103 101 100 104
1450 100 102 100 102 102
>>> df_c
c1 c2 c3 c4 c5
2614 0.338295 0.190882 0.157231 0.135776 0.177816
3735 0.097800 0.124296 0.268475 0.265111 0.244319
1450 0.160922 0.403703 0.122390 0.130612 0.182373
>>> df_f
c
100 0.183946
101 0.290311
102 0.192049
103 0.725704
104 0.143359
ALGO
For each row in df_p, df_c:
1. update each score in df_c row with df_c * df_f[label] where label is from p
2. reorder elements of df_c in descending scores
3. reorder elements in df_p with order from df_c
例如,df_c
中的第一个计算单元格将为0.338295*0.143359
这是我所拥有的代码,虽然工作非常缓慢:
np_p = []
np_c = []
for i in range(len(df_p)):
## determine revised scores
# Step 1. Revise scores
r_conf = df_c.iloc[[i]].values[0] # scores for row
r_place_id = df_p.iloc[[i]].values[0] # labels for row
p_c = df_f.ix[r_place_id].c.values # class conf for labels
t_conf = r_conf*p_c # total score
# Reorder labels
# Step 2. reorder by revised score
c = np.sort(t_conf)[::-1]
c_sort = np.argsort(t_conf)[::-1]
# Step 3. reorder labels with revised score order
p_sort = df_p.iloc[[i]][df_p.columns[c_sort]].values
np_c.append(c)
np_p.append(p_sort)
理想情况下,我想创建一个类似df_p
和df_c
的数据框,但需要重新排序和修改后的值(np_p
和np_c
)。
关于如何让它变得更快的任何想法。
感谢!!!
答案 0 :(得分:2)
您可以使用DateFrame.replace
method将df_p
中的值替换为df_f
中的值:
In [124]: df_pf = df_p.replace(df_f['c']); df_pf
Out[124]:
p1 p2 p3 p4 p5
2614 0.143359 0.143359 0.192049 0.192049 0.192049
3735 0.183946 0.725704 0.290311 0.183946 0.143359
1450 0.183946 0.192049 0.183946 0.192049 0.192049
由于Pandas在乘以两个DataFrame之前对齐索引,如果我们剥离了
p
和c
离开列标签,然后我们就可以获得所需的产品
使用df_pf.mul(df_c)
:
df_pf.columns = df_pf.columns.str.extract(r'(\d+)', expand=False)
df_c.columns = df_c.columns.str.extract(r'(\d+)', expand=False)
df_c = df_pf.mul(df_c)
可以使用指定了np.argsort
的{{1}}获取每行的列的正确顺序。 axis=1
返回的order
数组可用于重新排序np.argsort
和df_c
:
df_p
以上使用NumPy的advanced integer indexing分别对每行中的值进行重新排序。
order = np.argsort(-df_c.values, axis=1)
nrows, ncols = df_c.shape
np_c = df_c.values[np.arange(nrows)[:,None], order]
np_p = df_p.values[np.arange(nrows)[:,None], order]
产量
import numpy as np
import pandas as pd
df_p = pd.DataFrame({'p1': [104, 100, 100],
'p2': [104, 103, 102],
'p3': [102, 101, 100],
'p4': [102, 100, 102],
'p5': [102, 104, 102]}, index=[2614,3735,1450])
df_c = pd.DataFrame({'c1': [0.33829499999999996, 0.097799999999999998, 0.16092200000000001],
'c2': [0.190882, 0.124296, 0.40370300000000003],
'c3': [0.15723099999999998, 0.26847500000000002, 0.12239000000000001],
'c4': [0.13577600000000001, 0.26511099999999999, 0.13061199999999998],
'c5': [0.177816, 0.24431900000000001, 0.18237300000000001]}, index=[2614,3735,1450])
df_f = pd.DataFrame({'c': [0.183946,
0.29031099999999999,
0.192049,
0.72570400000000002,
0.14335899999999999]}, index=list(range(100,105)))
def using_pandas(df_p, df_c, df_f):
# this works no matter the order of the columns and rows of `df_p` and `df_c`.
# aligns `df_p` and `df_c` based on the numeric part of their column names
df_pf = df_p.replace(df_f['c'])
# change the column names to match since Pandas will align the indices before multiplying
df_pf.columns = df_pf.columns.str.extract(r'(\d+)', expand=False)
df_c.columns = df_c.columns.str.extract(r'(\d+)', expand=False)
df_c = df_pf.mul(df_c)
order = np.argsort(-df_c.values, axis=1)
nrows, ncols = df_c.shape
np_c = df_c.values[np.arange(nrows)[:,None], order]
np_p = df_p.values[np.arange(nrows)[:,None], order]
return np_c, np_p
np_c, np_p = using_pandas(df_p, df_c, df_f)
print(np_c)
print(np_p)
或者,如果[[ 0.04849763 0.03414938 0.03019606 0.02736465 0.02607565]
[ 0.0902021 0.07794125 0.04876611 0.03502533 0.01798992]
[ 0.07753076 0.03502455 0.02960096 0.0250839 0.02251315]]
[[104 102 102 104 102]
[103 101 100 104 100]
[102 102 100 102 100]]
和df_p
的列和行已经对齐,
那么你可以通过在NumPy而不是Pandas中进行乘法来获得更快的速度:
df_c
对于这些小型DataFrame,def using_numpy(df_p, df_c, df_f):
# faster than using_pandas, but assumes `df_p` and `df_c` are already aligned
df_pf = df_p.replace(df_f['c'])
df_pf = df_pf.values
df_c = df_c.values
df_p = df_p.values
df_c = df_pf * df_c
order = np.argsort(-df_c, axis=1)
nrows, ncols = df_c.shape
np_c = df_c[np.arange(nrows)[:,None], order]
np_p = df_p[np.arange(nrows)[:,None], order]
return np_c, np_p
略快于using_numpy
。
如果DataFrames更大,速度的差异会更明显。
但同样,请注意using_pandas
依赖于已经对齐的指数。
using_numpy
答案 1 :(得分:0)
试试这个:首先创建一个<div class="articles">
<img alt="image1" src="http://dummyimage.com/143x85/666/fff">
<figcaption>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Beatus sibi videtur. 3:40</figcaption>
</div>
df_f
然后将其映射到df_p:
di = df_f['c'].to_dict()
{100: 0.183946,
101: 0.29031099999999999,
102: 0.192049,
103: 0.72570400000000002,
104: 0.14335899999999999}
然后进行乘法运算:
df_p.replace(di)
# p1 p2 p3 p4 p5
# 2614 0.143359 0.143359 0.192049 0.192049 0.192049
# 3735 0.183946 0.725704 0.290311 0.183946 0.143359
# 1450 0.183946 0.192049 0.183946 0.192049 0.192049