Question

我有两个数据集

    df1 = pd.DataFrame ({"skuid" :("A","B","C","D"), "price": (0,0,0,0)})
    df2 = pd.DataFrame ({"skuid" :("A","B","C","D"),"salesprice" :(10,0,0,30),"regularprice" : (9,10,0,2)})

我想在条件中插入销售价格和正常价格：如果df1 skuid和df2 skuid匹配并且df2 salesprice不为零，请使用salesprice作为价格值。如果sku的匹配项和df2 salesprice为零，则使用Regularprice。如果不使用零作为价格值。

def pric(df1,df2):
if (df1['skuid'] == df2['skuid'] and salesprice !=0): 
 price = salesprice 
elif (df1['skuid'] == df2['skuid'] and regularprice !=0):
 price = regularprice
else:
 price = 0

我在类似的条件下创建了一个函数，但是它不起作用。结果应类似于df1

skuid  price
  A      10
  B      10
  C      0
  D      30

谢谢。

Answer 1

因此，上述功能存在许多问题。以下是一些不分先后的顺序：

Python中的缩进很重要https://docs.python.org/2.0/ref/indentation.html
向量化函数与循环。您提供的函数看起来像是模糊的，希望可以在矢量化的基础上应用，但是python不能那样工作。您需要遍历要查看的行（https://wiki.python.org/moin/ForLoop）。尽管在python中支持列转换（无需循环即可工作），但需要专门调用它们（此处是一些有关https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transform.html功能实例的文档）。
相关地，访问数据框元素并为Indexing Pandas data frames: integer rows, named columns编制索引
返回：如果希望python函数为您提供结果，则应让它返回值。并非所有的编程语言都需要这个（朱莉娅），但是在python中，您应该/必须。
一般性。在一次性应用程序中这不是绝对必要的，但是如果您更改了数据框中的列名，则函数很容易被破坏。出于这个原因并且出于简单的灵活性，最好是允许用户在输入中给出相关名称。

这是您的函数的一个版本，为解决上述特定问题而进行了或多或少的更改

import pandas as pd

df1 = pd.DataFrame({"skuid" :("A","B","C","D"), "price": (0,0,0,0)})
df2 = pd.DataFrame({"skuid" :("A","B","C","D"),"salesprice" :(10,0,0,30),"regularprice" : (9,10,0,2)})


def pric(df1, df2, id_colname,df1_price_colname, df2_salesprice_colname,df2_regularprice_colname):
    for i in range(df1.shape[0]):
        for j in range(df2.shape[0]):
            if (df1.loc[df1.index[i],id_colname] == df2.loc[df2.index[j],id_colname] and df2.loc[df2.index[j],df2_salesprice_colname] != 0):
             df1.loc[df1.index[i],df1_price_colname] = df2.loc[df2.index[j],df2_salesprice_colname]
             break
            elif (df1.loc[df1.index[i],id_colname] == df2.loc[df2.index[j],id_colname] and df2.loc[df2.index[j],df2_regularprice_colname] != 0):
             df1.loc[df1.index[i],df1_price_colname]  = df2.loc[df2.index[j],df2_regularprice_colname]
             break
    return df1

要输入的内容


df1_imputed=pric(df1,df2,'skuid','price','salesprice','regularprice')
print(df1_imputed['price'])

给予

0    10
1    10
2     0
3    30
Name: price, dtype: int64

在检查由行索引/列对指定的特定元素的相等条件之前，请注意该函数如何遍历行索引。

需要考虑的几件事：

为什么代码通过df1循环“在”通过df2的循环之上？相关地，中断条件有什么作用？
为什么省略else条件？
“ df1.loc [df1.index [i]，id_colname]”的全部含义是什么？（提示：请检查以上链接之一）

根据其他数据框列填充列

1 个答案: