我有一个像这样的数据框
year end id start
1949 1954.0 ABc 1949.0
1950 1954.0 ABc 1949.0
1951 1954.0 ABc 1949.0
1952 1954.0 ABc 1949.0
1953 1954.0 ABc 1949.0
1954 1954.0 ABc 1949.0
1950 1954.0 xyz 1949.0
1951 1954.0 xyz 1949.0
1952 1954.0 xyz 1949.0
1953 1954.0 xyz 1949.0
1954 1954.0 xyz 1949.0
1949 1954.0 cde 1949.0
1950 1954.0 cde 1949.0
1951 1954.0 cde 1949.0
1952 1954.0 cde 1949.0
1953 1954.0 cde 1949.0
我必须找到该ID的缺失年份,输出应该像这样
year end id start
1949 1954 xyz 1949
1954 1954 cde 1949
我们必须检查该ID的开头和结尾是否可用。
我如何实现这一目标。
答案 0 :(得分:0)
这应该有效;查看代码中的注释以澄清我在做什么:
import pandas as pd
from functools import reduce
# reading the dataframe from your sample
df = pd.read_clipboard()
df['start'] = df['start'].astype('int')
df['end'] = df['end'].astype('int')
# create a function that finds the min start date and max end date
def findRange(row):
return list(range(row['startMin'], row['endMax']+1))
# create three groupped dataframes and create a list for year start min and start max
year_list = pd.DataFrame(df.groupby('id')['year'].apply(list))
start_min = pd.DataFrame(df.groupby('id')['start'].apply(min)).rename(columns={'start':'startMin'})
end_max = pd.DataFrame(df.groupby('id')['end'].apply(max)).rename(columns={'end':'endMax'})
# apply the findRange function for each grouped ID to see the date range we are looking for
dfs = [year_list,start_min,end_max]
df_final = reduce(lambda left,right: pd.merge(left,right,on='id'), dfs)
df_final['Range'] = df_final.apply(findRange, axis=1)
df_final.reset_index(inplace=True)
# create a noMatch function to find all the values in list year that are not in the range created above
def noMatch(a, b):
return [x for x in b if x not in a]
# use a for loop to iterate through all the rows and find the missing year
df1 = []
for i in range(0, len(df_final)):
df1.append(noMatch(df_final['year'][i],df_final['Range'][i]))
# create a new dataframe and get your desiered output: my column names are different and in a different order;
# however, the output is the same as your desired output
missing_year = pd.DataFrame(df1).rename(columns={0:'missingYear'})
df_concat = pd.concat([df_final, missing_year], axis=1)
df_concat = df_concat[['id','startMin','endMax','missingYear']]
df_concat = df_concat[df_concat['missingYear'].notnull()]
df_concat['missingYear'] = df_concat['missingYear'].astype('int')
df_concat
id startMin endMax missingYear
1 cde 1949 1954 1954
2 xyz 1949 1954 1949
答案 1 :(得分:0)
您可以使用groupby
并设置以下差异:
# first convert to integer
df['start'] = df['start'].astype('int')
df['end'] = df['end'].astype('int')
print (df.groupby(['id','start','end'])
.apply(lambda x: set(range(x.start.iloc[0], x.end.iloc[0]+1))-set(x.year)))
id start end
ABc 1949 1954 {}
cde 1949 1954 {1954}
xyz 1949 1954 {1949}
dtype: object
现在,如果需要输出格式,请将set
更改为list
,添加dropna
,astype
和reset_index
:
df_missing = (df.groupby(['id','start','end'])
.apply(lambda x: [*(set(range(x.start.iloc[0], x.end.iloc[0]+1))-set(x.year))])
.str[0].dropna().astype(int).reset_index(name='missing_year'))
print (df_missing)
id start end missing_year
0 cde 1949 1954 1954
1 xyz 1949 1954 1949