我有一个包含许多列的数据框(但为了简化发布,此处仅发布col1,col2,col3):
id col1 col2 col3 source_id
a1 765.3 234 cat a5
a2 3298.3 none dog a4
a3 8762.1 27 rat a8
a4 none none none none
a5 none none none a6
a6 none none none none
我想用none values of source _id
来填充values from id
。
例如,source_id a5 row has none
必须替换为id a1 values
,随后source_id a6 row having none
必须替换为a5 row
输出:
id col1 col2 col3 source_id
a1 765.3 234 cat a5
a2 3298.3 none dog a4
a3 8762.1 27 rat a8
a4 3298.3 none dog none
a5 765.3 234 cat a6
a6 765.3 234 cat none
答案 0 :(得分:1)
首先看起来none
是字符串,所以将它们替换为缺少的值:
df = df.mask(df.eq('none'), None)
然后用connected_components
在networkx
中创建字典:
import networkx as nx
# Create the graph from the dataframe
g = nx.Graph()
g.add_edges_from(df[['id','source_id']].dropna().itertuples(index=False))
connected_components = nx.connected_components(g)
# Find the component id of the nodes
node2id = {}
for cid, component in enumerate(connected_components):
for node in component:
node2id[node] = cid + 1
print (node2id)
{'a6': 1, 'a5': 1, 'a1': 1, 'a2': 2, 'a4': 2, 'a8': 3, 'a3': 3}
通过映射的id
列进行最后分组,并通过向前和向后填充替换None
:
df1 = (df.groupby(df['id'].map(node2id))
.apply(lambda x: x.ffill().bfill())
.assign(source_id = df['source_id']))
print (df1)
id col1 col2 col3 source_id
0 a1 765.3 234 cat a5
1 a2 3298.3 None dog a4
2 a3 8762.1 27 rat a8
3 a4 3298.3 None dog None
4 a5 765.3 234 cat a6
5 a6 765.3 234 cat None
答案 1 :(得分:0)
您应该做的第一件事是将id列设置为索引,以便您查找该行以填充单元格
df = df.set_index('id')
然后,您可以遍历各列并填充它们
for col in df.columns:
if col == 'source_id':
continue
for idx in df.index:
dst_idx = df.source_id[idx]
if (df[col][idx] != 'none'
and dst_idx != 'none'
and dst_idx in df.index and
df[col][dst_idx] == 'none'):
df[col][dst_idx] = df[col][idx]
col1 col2 col3 source_id
id
a1 765.3 234 cat a5
a2 3298.3 none dog a4
a3 8762.1 27 rat a8
a4 3298.3 none dog none
a5 765.3 234 cat a6
a6 765.3 234 cat none