Question

考虑这个简单的例子

pd.DataFrame({'id' : [1,1,2,3,4],
              'place' : ['bar','pool','bar','kitchen','bar']})

Out[4]: 
   id    place
0   1      bar
1   1     pool
2   2      bar
3   3  kitchen
4   4      bar

这里的网络结构是，给定的id如果连接到另一个id，则连接到同一位置。

例如，此处1连接到2和4，因为它们位于bar上。

1和3未连接，因为1转到了bar和pool，其中不包括kitchen（唯一的地方{ {1}}去了

我的真实数据非常庞大，大约有50万。继续获取3的最有效方法是什么？这只是一个adjacency list格式的字符串，类似于https://networkx.github.io/documentation/networkx-1.10/reference/readwrite.adjlist.html

source target target

我们可以避免循环并使用熊猫技巧吗？

谢谢！

Answer 1

使用unique，然后将{d1的第0列和第1列的同时切换为concat

adj=pd.DataFrame(df.groupby('place').id.unique().loc[lambda x : x.str.len()>1].tolist())
pd.concat([adj,adj.rename(columns={0:1,1:0})])
Out[810]: 
   0  1
0  1  2
0  2  1

更新：

newdf=df.merge(df,on='place')
x=nx.from_pandas_dataframe(newdf,'id_x','id_y') # using merge to get the connect for all id by link columns place. 
[list(itertools.permutations(x, len(x)) for x in list(nx.connected_components(x))] # using permutations get the all combination for each  connected_components in networkx 
Out[821]: [[(1, 2), (1, 3), (2, 1), (2, 3), (3, 1), (3, 2)]]

数据输入

df
Out[822]: 
   id place
0   1   bar
1   1  pool
2   2   bar
3   3   bar

Answer 2

那又怎么样：

>>> df
   id    place
0   1      bar
1   1     pool
2   2      bar
3   3  kitchen
>>> df.groupby('place').id.nunique().value_counts()
1    2
2    1
Name: id, dtype: int64

网络邻接表的有效方法？

2 个答案: