使用python pandas检查数据中的开始和结束

时间:2018-08-23 04:40:41

标签: python pandas

df1

id start end data
1  2001  2004 [[2004,1],[2003,2],[2002,6],[2001,0.9]]
2  2001  2004 [[2005,1],[2003,2],[2002,6],[2001,0.9]]
3  2001  2004 [[2004,1],[2003,2],[2002,6]]

输出

id missed_one
2  2004
3  2001

那是输出。

我必须从头到尾检查数据中是否可用。如果缺少任何数据,则应打印输出。

3 个答案:

答案 0 :(得分:2)

您可以使用set差异

df[['start', 'end']].agg(set,1) - df.data.transform(lambda k: set([item for z in k for item in z]))

1        {}
2    {2004}
3    {2001}
dtype: object

答案 1 :(得分:0)

使用列表理解和zip

out = df.assign(missing=[
    [i for i in range(start, end+1) if i not in {d for d, _ in datum}] or np.nan
    for datum, start, end in zip(df.data, df.start, df.end)
])

  id  start   end                                            data missing
0   1   2001  2004  [[2004, 1], [2003, 2], [2002, 6], [2001, 0.9]]     NaN
1   2   2001  2004  [[2005, 1], [2003, 2], [2002, 6], [2001, 0.9]]  [2004]
2   3   2001  2004               [[2004, 1], [2003, 2], [2002, 6]]  [2001]

因此,如果您只希望缺少年份的行:

out.loc[out.missing.notnull()]

   id  start   end                                            data missing
1   2   2001  2004  [[2005, 1], [2003, 2], [2002, 6], [2001, 0.9]]  [2004]
2   3   2001  2004               [[2004, 1], [2003, 2], [2002, 6]]  [2001]

如果您只想显示一个缺失值,而不是缺失值列表,则可以使用next

df.assign(missing=[
    next((i for i in range(start, end+1) if i not in {d for d, _ in datum}), np.nan)
    for datum, start, end in zip(df.data, df.start, df.end)
])

   id  start   end                                            data  missing
0   1   2001  2004  [[2004, 1], [2003, 2], [2002, 6], [2001, 0.9]]      NaN
1   2   2001  2004  [[2005, 1], [2003, 2], [2002, 6], [2001, 0.9]]   2004.0
2   3   2001  2004               [[2004, 1], [2003, 2], [2002, 6]]   2001.0

一些时间:

df = pd.concat([df]*10000)

In [145]: %%timeit
     ...: out = df.assign(missing=[^M
     ...:     [i for i in range(start, end+1) if i not in {d for d, _ in datum}] or np.nan^M
     ...:     for datum, start, end in zip(df.data, df.start, df.end)^M
     ...: ])
     ...:
72.3 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [146]: %%timeit
     ...: df[['start', 'end']].agg(set,1) - df.data.transform(lambda k: set([item for z in k for item in z]))
     ...:
503 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

答案 2 :(得分:0)

您可以减去set

#if necessary convert to nested lists
import ast
df['data'] = df['data'].apply(ast.literal_eval)

df = df.set_index('id')
ranges = df[['start', 'end']].apply(lambda x: set(range(x['start'], x['end'] + 1)), axis=1)
data = df['data'].apply(lambda k: set([z[0] for z in k]))

out = (ranges - data).to_dict()
print (out)
{1: set(), 2: {2004}, 3: {2001}}

df1 = pd.DataFrame([(k, v1) for k, v in out.items() for v1 in v], columns=['id','missed_one'])
print (df1)
   id  missed_one
0   2        2004
1   3        2001

详细信息

print (ranges)
id
1    {2001, 2002, 2003, 2004}
2    {2001, 2002, 2003, 2004}
3    {2001, 2002, 2003, 2004}

print (data)
id
1    {2001, 2002, 2003, 2004}
2    {2001, 2002, 2003, 2005}
3          {2002, 2003, 2004}
Name: data, dtype: object