Creating new variable using information from multiple lines in dataframe

时间:2019-01-18 18:40:21

标签: python pandas dataframe

I have a dataframe which looks like this:

df = pd.DataFrame({"HouseholdNumber": [1, 1, 1, 1, 1, 2, 2], "TypeOfPerson": ["Son", "Daughter", "Daughter", "Parent", "Parent", "Daughter", "Parent"], "Age": [17, 10, 20, 52, 45, 22, 50]})
print(df)
   HouseholdNumber TypeOfPerson  Age
0                1          Son   17   
1                1     Daughter   10   
2                1     Daughter   20  
3                1       Parent   52     
4                1       Parent   45    
5                2     Daughter   22    
6                2       Parent   50      

and I want to create a new variable using information from multiple lines. This is a problem for me because I'm having problems with using a simple df.loc (or np.where) condition. Specifically, I want the new variable to have the value no in case the person is not a parent or has no child in the age groups, an a if the parent has a child which is 18 years old or younger and a b if the parent has a child which is between 19 and 25 years old. If the parents have a child of both age groups, the value should still be an a. The HouseholdNumber indicates the different families, so all the conditions should apply for each Household. So, the dataframe should look like this:

   HouseholdNumber TypeOfPerson  Age Child
0                1          Son   17    no
1                1     Daughter   10    no
2                1     Daughter   20    no
3                1       Parent   52     a
4                1       Parent   45     a
5                2     Daughter   22    no
6                2       Parent   50     b 

The code I'm trying is

df["Child"]=""
for i in df["HouseholdNumber"].unique():
    if (df.loc[df.TypeOfPerson.isin(["Son", "Daughter"]) & (df.Age <= 18)]):
       if (df.loc[(df.TypeOfPerson == "Parent")]):
           df["Child"] = "a"
    elif (df.loc[df.TypeOfPerson.isin(["Son", "Daughter"]) & ((df.Age >= 19) & (df.Age <= 26))]):
       df["Child"] = "b"
    else:
        df["Child"] = "no"

which gives me the error The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). I'm not really sure where to go from here, I always get this error. Even without the error I suspect that my code will not give the desired result though.

2 个答案:

答案 0 :(得分:1)

此处的错误是,您使用索引列表访问df.loc,例如:

 df.loc[df.TypeOfPerson.isin(["Son", "Daughter"]) & (df.Age <= 18)]

将返回一个包含几行的数据框。因此,当您将其放在if后面时,它会询问如何将该数据帧评估为布尔值,它是any单元格True还是all单元格{{ 1}}等。

解决错误的一种方法是指定所述操作,或者在您的情况下,您想知道房子是否有孩子,只需检查切片数据帧的长度即可。

True

当然,这只是解决问题的一种方法,而不是最好的方法。

答案 1 :(得分:1)

我会使用groupby这样的方式,因为您可以一次与每个家庭打交道。

示例(请注意,并非所有案件都得到处理)

import pandas as pd

# Create the dataframe
df = pd.DataFrame(data={
    "TypeOfPerson": ["Son", "Parent", "Daughter", "Son", "Parent", "Daughter", "Daughter", "Parent", "Son"],
    "HouseholdNumber": [1, 1, 1, 1, 2, 2, 2, 3, 3],
    "Age": [17,50,20,13,40,19,5, 50, 25]
})

# Add new column
df["Child"] = pd.Series()

# Group by household
households = df.groupby("HouseholdNumber")

# Iterate through groups
for household_number in households.groups:
    household = households.get_group(household_number)

    # Household offspring
    offspring = household.query("TypeOfPerson == 'Son' | TypeOfPerson == 'Daughter'")

    # Sons and daughters that are 18 or younger
    children = offspring.query("Age <= 18")

    # Sons and daughters that young adults (19 >= age <= 25)
    young_adults = household.query("Age >= 19 & Age <= 25")

    # Parents
    parents = household.query("TypeOfPerson == 'Parent'")

    # Change original data frame
    df.loc[offspring.index, "Child"] = "No"
    if children.shape[0]:
        df.loc[parents.index, "Child"] = "a" 
    elif young_adults.shape[0]:
         df.loc[parents.index, "Child"] = "b"