我正在尝试在数据框中填充数字的缺失值。每个变量组的日期范围从1到100,一旦日期达到100,某些变量的第二个日期周期又从1开始。在变量中,date
可以重复。我需要从1到100填充它们。例如,A的值为1,2,3,3,4,5,6,10,然后又是1,2,3,3,4。我需要它们分别是1,2,3,3,4,5,6,7,8,9,10,11,12,13,14 ......... 100再是1,2, 3,3,4,5,6,7,8,9,10,11,12,13,14 ......... 100。当我填写日期时,我想在其余各栏中填写NaN
。
df = pd.DataFrame({"date": [1,2,3,3,4,5,6,10,1,2,3,3,4,1,1,1,4,4,4,1,1,1,2,2,3,3,3,4,4],
"var": ["A","A","A", "A", "A", "A","A","A","A", "A", "A","A","A", "B", "B", "B","B","B","B" ,"C", "C", "C","C", "D","D","D","D","D","D"],
"no": [ 1.5, 1.5,1, 2.2, 3.5, 1.5, 1.5, 1.2, 1.3, 1.1, 2, 3,1, 2.2, 3.5, 1.5, 1.5, 1.2, 1.3, 1.1, 2, 3,9,1.2, 1.3, 1.1, 2, 3,9],
"value": [ -1.135632, 1.212112,0.469112, -0.282863, -1.509059, -1.135632, 1.212112, -0.173215,
0.119209, -1.044236, -0.861849, None,0.469112, -0.282863, -1.509059, -1.135632, 1.212112, -0.173215,
0.119209, -1.044236, -0.861849, None,0.87,1.2, 1.3, 1.1, 2, 3,9]})
date var no value
0 1 A 1.5 -1.135632
1 2 A 1.5 1.212112
2 3 A 1.0 0.469112
3 3 A 2.2 -0.282863
4 4 A 3.5 -1.509059
5 5 A 1.5 -1.135632
6 6 A 1.5 1.212112
7 10 A 1.2 -0.173215
8 1 A 1.3 0.119209
9 2 A 1.1 -1.044236
10 3 A 2.0 -0.861849
11 3 A 3.0 NaN
12 4 A 1.0 0.469112
13 1 B 2.2 -0.282863
14 1 B 3.5 -1.509059
15 1 B 1.5 -1.135632
16 4 B 1.5 1.212112
17 4 B 1.2 -0.173215
18 4 B 1.3 0.119209
19 1 C 1.1 -1.044236
20 1 C 2.0 -0.861849
21 1 C 3.0 NaN
22 2 C 9.0 0.870000
23 2 D 1.2 1.200000
24 3 D 1.3 1.300000
25 3 D 1.1 1.100000
26 3 D 2.0 2.000000
27 4 D 3.0 3.000000
28 4 D 9.0 9.000000
所需的输出为:
date var no value
1 A 1.5 -1.135632
2 A 1.5 1.212112
3 A 1.0 0.469112
3 A 2.2 -0.282863
4 A 3.5 -1.509059
5 A 1.5 -1.135632
6 A 1.5 1.212112
7 A NaN NaN
8 A NaN NaN
9 A NaN NaN
. . .... ..........
. . .... ..........
. . .... ..........
100 A 1.2 -0.173215
这只是一个小组的示例。我在数据框中至少有300个这样的组,总共有100,000行。在这里,重复日期3,但我需要保持原样。请帮忙!
答案 0 :(得分:1)
似乎您只想用一列来组织日期,而不管实际的日期列怎么说。这是一个解决方案,可创建一个名为“ Date_New”的新列来为您完成此操作。在此,Date_New列出了1,2,3,3,4,5,6,7,8,9,10,11,12,13,14 ......... 100个组和子组。
此外,您提供的示例已经具有显示为NaN的NaN值。如果实际数据不同,则可以使用答案中的第一行将任何字符串替换为NaN。 [即df.replace(“ Nothing”,np.NaN)或df.replace(“ Nada”,np.NaN)]
#Replace whatever strings here with NaNs
df = df.replace("None", np.NaN)
#Create separate dataframes for each group
df_groups = df.groupby('var')
date_list = []
counter = 0
#Loop through every group, assigning the index number to date_list
#If index > 100, start the count over by subtracting 99
for group, df_group in df_groups:
for i, row in zip(range(len(df_group)), df_group.iterrows()):
counter = counter + 1
if counter <= 100:
date_list.append(i+1)
else:
date_list.append(i-99)
#Create a new column called Date_new
df['Date_New'] = date_list
答案 1 :(得分:0)
修改后的答案:
import pandas as pd
from numpy import nan
df = pd.DataFrame({"date": [1,2,3,3,4,5,6,10,1,2,3,3,4,1,1,1,4,4,4,1,1,1,2,2,3,3,3,4,4],
"var": ["A","A","A", "A", "A", "A","A","A","A", "A", "A","A","A", "B", "B", "B","B","B","B" ,"C", "C", "C","C", "D","D","D","D","D","D"],
"no": [ 1.5, 1.5,1, 2.2, 3.5, 1.5, 1.5, 1.2, 1.3, 1.1, 2, 3,1, 2.2, 3.5, 1.5, 1.5, 1.2, 1.3, 1.1, 2, 3,9,1.2, 1.3, 1.1, 2, 3,9],
"value": [ -1.135632, 1.212112,0.469112, -0.282863, -1.509059, -1.135632, 1.212112, -0.173215,
0.119209, -1.044236, -0.861849, None,0.469112, -0.282863, -1.509059, -1.135632, 1.212112, -0.173215,
0.119209, -1.044236, -0.861849, None,0.87,1.2, 1.3, 1.1, 2, 3,9]})
group_ident=[]
for df_group_index in df['var']:
if df_group_index in group_ident:
pass
else:
group_ident.append(df_group_index)
counter=[0 for i in range(len(group_ident))]
df_default=pd.DataFrame({"date":[i for i in range(len(df))],
"var":["A" for i in range(len(df))],
"no":[nan for i in range(len(df))],
"value":[nan for i in range(len(df))]})
for index in range(len(df_default)):
date_case=0
if df['date'][index]<counter[group_ident.index(df['var'][index])]:
date_case=counter[group_ident.index(df['var'][index])]+df['date'][index]
else:
date_case=df['date'][index]
counter[group_ident.index(df['var'][index])]=df['date'][index]
print('date case = ' +str(date_case))
print(counter[group_ident.index(df['var'][index])])
for key in df:
if key == 'date':
df_default[key][index]=date_case
else:
df_default[key][index]=df[key][index]
print(df_default)