熊猫数据框:是否可以在组内的循环中填充缺失的值?

时间:2019-08-15 17:29:02

标签: python pandas dataframe

我正在尝试在数据框中填充数字的缺失值。每个变量组的日期范围从1到100,一旦日期达到100,某些变量的第二个日期周期又从1开始。在变量中,date可以重复。我需要从1到100填充它们。例如,A的值为1,2,3,3,4,5,6,10,然后又是1,2,3,3,4。我需要它们分别是1,2,3,3,4,5,6,7,8,9,10,11,12,13,14 ......... 100再是1,2, 3,3,4,5,6,7,8,9,10,11,12,13,14 ......... 100。当我填写日期时,我想在其余各栏中填写NaN

df = pd.DataFrame({"date": [1,2,3,3,4,5,6,10,1,2,3,3,4,1,1,1,4,4,4,1,1,1,2,2,3,3,3,4,4],
               "var": ["A","A","A", "A", "A", "A","A","A","A", "A", "A","A","A", "B", "B", "B","B","B","B" ,"C", "C", "C","C", "D","D","D","D","D","D"],
               "no": [ 1.5, 1.5,1, 2.2, 3.5, 1.5, 1.5, 1.2, 1.3, 1.1, 2, 3,1, 2.2, 3.5, 1.5, 1.5, 1.2, 1.3, 1.1, 2, 3,9,1.2, 1.3, 1.1, 2, 3,9],
               "value": [ -1.135632, 1.212112,0.469112, -0.282863, -1.509059, -1.135632, 1.212112, -0.173215,
                         0.119209, -1.044236, -0.861849, None,0.469112, -0.282863, -1.509059, -1.135632, 1.212112, -0.173215,
                         0.119209, -1.044236, -0.861849, None,0.87,1.2, 1.3, 1.1, 2, 3,9]})
 date  var  no      value
0   1   A   1.5    -1.135632
1   2   A   1.5     1.212112
2   3   A   1.0     0.469112
3   3   A   2.2    -0.282863
4   4   A   3.5    -1.509059
5   5   A   1.5    -1.135632
6   6   A   1.5     1.212112
7   10  A   1.2    -0.173215
8   1   A   1.3     0.119209
9   2   A   1.1    -1.044236
10  3   A   2.0    -0.861849
11  3   A   3.0    NaN
12  4   A   1.0    0.469112
13  1   B   2.2    -0.282863
14  1   B   3.5    -1.509059
15  1   B   1.5    -1.135632
16  4   B   1.5    1.212112
17  4   B   1.2    -0.173215
18  4   B   1.3    0.119209
19  1   C   1.1    -1.044236
20  1   C   2.0    -0.861849
21  1   C   3.0    NaN
22  2   C   9.0    0.870000
23  2   D   1.2    1.200000
24  3   D   1.3    1.300000
25  3   D   1.1    1.100000
26  3   D   2.0    2.000000
27  4   D   3.0    3.000000
28  4   D   9.0    9.000000

所需的输出为:

date   var  no      value
1       A   1.5    -1.135632
2       A   1.5     1.212112
3       A   1.0     0.469112
3       A   2.2    -0.282863
4       A   3.5    -1.509059
5       A   1.5    -1.135632
6       A   1.5     1.212112
7       A       NaN        NaN
8       A       NaN        NaN 
9       A       NaN        NaN  
.       .       ....       ..........
.       .       ....       ..........
.       .       ....       ..........
100 A   1.2    -0.173215

这只是一个小组的示例。我在数据框中至少有300个这样的组,总共有100,000行。在这里,重复日期3,但我需要保持原样。请帮忙!

2 个答案:

答案 0 :(得分:1)

似乎您只想用一列来组织日期,而不管实际的日期列怎么说。这是一个解决方案,可创建一个名为“ Date_New”的新列来为您完成此操作。在此,Date_New列出了1,2,3,3,4,5,6,7,8,9,10,11,12,13,14 ......... 100个组和子组。

此外,您提供的示例已经具有显示为NaN的NaN值。如果实际数据不同,则可以使用答案中的第一行将任何字符串替换为NaN。 [即df.replace(“ Nothing”,np.NaN)或df.replace(“ Nada”,np.NaN)]

#Replace whatever strings here with NaNs
df = df.replace("None", np.NaN)

#Create separate dataframes for each group
df_groups = df.groupby('var')

date_list = []
counter = 0

#Loop through every group, assigning the index number to date_list
#If index > 100, start the count over by subtracting 99 
for group, df_group in df_groups:
    for i, row in zip(range(len(df_group)), df_group.iterrows()):
        counter = counter + 1
        if counter <= 100:
            date_list.append(i+1)
        else:
            date_list.append(i-99)

#Create a new column called Date_new       
df['Date_New'] = date_list 

答案 1 :(得分:0)

修改后的答案:

import pandas as pd
from numpy import nan

df = pd.DataFrame({"date": [1,2,3,3,4,5,6,10,1,2,3,3,4,1,1,1,4,4,4,1,1,1,2,2,3,3,3,4,4],
               "var": ["A","A","A", "A", "A", "A","A","A","A", "A", "A","A","A", "B", "B", "B","B","B","B" ,"C", "C", "C","C", "D","D","D","D","D","D"],
               "no": [ 1.5, 1.5,1, 2.2, 3.5, 1.5, 1.5, 1.2, 1.3, 1.1, 2, 3,1, 2.2, 3.5, 1.5, 1.5, 1.2, 1.3, 1.1, 2, 3,9,1.2, 1.3, 1.1, 2, 3,9],
               "value": [ -1.135632, 1.212112,0.469112, -0.282863, -1.509059, -1.135632, 1.212112, -0.173215,
                         0.119209, -1.044236, -0.861849, None,0.469112, -0.282863, -1.509059, -1.135632, 1.212112, -0.173215,
                         0.119209, -1.044236, -0.861849, None,0.87,1.2, 1.3, 1.1, 2, 3,9]})

group_ident=[]
for df_group_index in df['var']:
    if df_group_index in group_ident:
        pass
    else:
        group_ident.append(df_group_index)

counter=[0 for i in range(len(group_ident))]

df_default=pd.DataFrame({"date":[i for i in range(len(df))],
"var":["A" for i in range(len(df))],
"no":[nan for i in range(len(df))],
"value":[nan for i in range(len(df))]})
for index in range(len(df_default)):
    date_case=0
    if df['date'][index]<counter[group_ident.index(df['var'][index])]:
        date_case=counter[group_ident.index(df['var'][index])]+df['date'][index]
    else:
        date_case=df['date'][index]
        counter[group_ident.index(df['var'][index])]=df['date'][index]
    print('date case = ' +str(date_case))
    print(counter[group_ident.index(df['var'][index])])
    for key in df:
        if key == 'date':
            df_default[key][index]=date_case
        else:
            df_default[key][index]=df[key][index]




print(df_default)