使用python pandas查找开始日期和结束日期

时间:2018-08-08 12:29:12

标签: python pandas

我有一个像这样的数据框

year  end         id   start                          
 1949  1954.0      ABc  1949.0    
 1950  1954.0      ABc  1949.0   
 1951  1954.0      ABc  1949.0    
 1952  1954.0      ABc  1949.0    
 1953  1954.0      ABc  1949.0    
 1954  1954.0      ABc  1949.0

 1950  1954.0      xyz  1949.0   
 1951  1954.0      xyz  1949.0    
 1952  1954.0      xyz  1949.0    
 1953  1954.0      xyz  1949.0    
 1954  1954.0      xyz  1949.0

 1949  1954.0      cde  1949.0    
 1950  1954.0      cde  1949.0   
 1951  1954.0      cde  1949.0    
 1952  1954.0      cde  1949.0    
 1953  1954.0      cde  1949.0  

我必须找到该ID的缺失年份,输出应该像这样

 year end id start
 1949 1954 xyz 1949
 1954 1954 cde 1949

我们必须检查该ID的开头和结尾是否可用。

我如何实现这一目标。

2 个答案:

答案 0 :(得分:0)

这应该有效;查看代码中的注释以澄清我在做什么:

import pandas as pd
from functools import reduce

# reading the dataframe from your sample
df = pd.read_clipboard()
df['start'] = df['start'].astype('int')
df['end'] = df['end'].astype('int')


# create a function that finds the min start date and max end date
def findRange(row):
    return list(range(row['startMin'], row['endMax']+1))

# create three groupped dataframes and create a list for year start min and start max
year_list = pd.DataFrame(df.groupby('id')['year'].apply(list))
start_min = pd.DataFrame(df.groupby('id')['start'].apply(min)).rename(columns={'start':'startMin'})
end_max = pd.DataFrame(df.groupby('id')['end'].apply(max)).rename(columns={'end':'endMax'})

# apply the findRange function for each grouped ID to see the date range we are looking for
dfs = [year_list,start_min,end_max]
df_final = reduce(lambda left,right: pd.merge(left,right,on='id'), dfs)
df_final['Range'] = df_final.apply(findRange, axis=1)
df_final.reset_index(inplace=True)

# create a noMatch function to find all the values in list year that are not in the range created above
def noMatch(a, b):
    return [x for x in b if x not in a]

# use a for loop to iterate through all the rows and find the missing year
df1 = []
for i in range(0, len(df_final)):
    df1.append(noMatch(df_final['year'][i],df_final['Range'][i]))

# create a new dataframe and get your desiered output: my column names are different and in a different order;
# however, the output is the same as your desired output
missing_year = pd.DataFrame(df1).rename(columns={0:'missingYear'})
df_concat = pd.concat([df_final, missing_year], axis=1)
df_concat = df_concat[['id','startMin','endMax','missingYear']]
df_concat = df_concat[df_concat['missingYear'].notnull()]
df_concat['missingYear'] = df_concat['missingYear'].astype('int')
df_concat


    id   startMin   endMax  missingYear
1   cde   1949      1954    1954
2   xyz   1949      1954    1949

答案 1 :(得分:0)

您可以使用groupby并设置以下差异:

# first convert to integer
df['start'] = df['start'].astype('int')
df['end'] = df['end'].astype('int')
print (df.groupby(['id','start','end'])
          .apply(lambda x: set(range(x.start.iloc[0], x.end.iloc[0]+1))-set(x.year)))
id   start  end 
ABc  1949   1954        {}
cde  1949   1954    {1954}
xyz  1949   1954    {1949}
dtype: object

现在,如果需要输出格式,请将set更改为list,添加dropnaastypereset_index

df_missing = (df.groupby(['id','start','end'])
                .apply(lambda x: [*(set(range(x.start.iloc[0], x.end.iloc[0]+1))-set(x.year))])
                .str[0].dropna().astype(int).reset_index(name='missing_year'))
print (df_missing)
    id  start   end  missing_year
0  cde   1949  1954          1954
1  xyz   1949  1954          1949