df1
id start end data
1 2001 2004 [[2004,1],[2003,2],[2002,6],[2001,0.9]]
2 2001 2004 [[2005,1],[2003,2],[2002,6],[2001,0.9]]
3 2001 2004 [[2004,1],[2003,2],[2002,6]]
输出
id missed_one
2 2004
3 2001
那是输出。
我必须从头到尾检查数据中是否可用。如果缺少任何数据,则应打印输出。
答案 0 :(得分:2)
您可以使用set
差异
df[['start', 'end']].agg(set,1) - df.data.transform(lambda k: set([item for z in k for item in z]))
1 {}
2 {2004}
3 {2001}
dtype: object
答案 1 :(得分:0)
使用列表理解和zip
:
out = df.assign(missing=[
[i for i in range(start, end+1) if i not in {d for d, _ in datum}] or np.nan
for datum, start, end in zip(df.data, df.start, df.end)
])
id start end data missing
0 1 2001 2004 [[2004, 1], [2003, 2], [2002, 6], [2001, 0.9]] NaN
1 2 2001 2004 [[2005, 1], [2003, 2], [2002, 6], [2001, 0.9]] [2004]
2 3 2001 2004 [[2004, 1], [2003, 2], [2002, 6]] [2001]
因此,如果您只希望缺少年份的行:
out.loc[out.missing.notnull()]
id start end data missing
1 2 2001 2004 [[2005, 1], [2003, 2], [2002, 6], [2001, 0.9]] [2004]
2 3 2001 2004 [[2004, 1], [2003, 2], [2002, 6]] [2001]
如果您只想显示一个缺失值,而不是缺失值列表,则可以使用next
:
df.assign(missing=[
next((i for i in range(start, end+1) if i not in {d for d, _ in datum}), np.nan)
for datum, start, end in zip(df.data, df.start, df.end)
])
id start end data missing
0 1 2001 2004 [[2004, 1], [2003, 2], [2002, 6], [2001, 0.9]] NaN
1 2 2001 2004 [[2005, 1], [2003, 2], [2002, 6], [2001, 0.9]] 2004.0
2 3 2001 2004 [[2004, 1], [2003, 2], [2002, 6]] 2001.0
一些时间:
df = pd.concat([df]*10000)
In [145]: %%timeit
...: out = df.assign(missing=[^M
...: [i for i in range(start, end+1) if i not in {d for d, _ in datum}] or np.nan^M
...: for datum, start, end in zip(df.data, df.start, df.end)^M
...: ])
...:
72.3 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [146]: %%timeit
...: df[['start', 'end']].agg(set,1) - df.data.transform(lambda k: set([item for z in k for item in z]))
...:
503 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
答案 2 :(得分:0)
您可以减去set
:
#if necessary convert to nested lists
import ast
df['data'] = df['data'].apply(ast.literal_eval)
df = df.set_index('id')
ranges = df[['start', 'end']].apply(lambda x: set(range(x['start'], x['end'] + 1)), axis=1)
data = df['data'].apply(lambda k: set([z[0] for z in k]))
out = (ranges - data).to_dict()
print (out)
{1: set(), 2: {2004}, 3: {2001}}
df1 = pd.DataFrame([(k, v1) for k, v in out.items() for v1 in v], columns=['id','missed_one'])
print (df1)
id missed_one
0 2 2004
1 3 2001
详细信息:
print (ranges)
id
1 {2001, 2002, 2003, 2004}
2 {2001, 2002, 2003, 2004}
3 {2001, 2002, 2003, 2004}
print (data)
id
1 {2001, 2002, 2003, 2004}
2 {2001, 2002, 2003, 2005}
3 {2002, 2003, 2004}
Name: data, dtype: object