Question

我有一个数据框，其中包含我通过pandas库存储的域列表（或我的情况下的顶点/节点）：

                 domain
0            airbnb.com
1          facebook.com
2                st.org
3              index.co
4        crunchbase.com
5               avc.com
6        techcrunch.com
7            google.com

我有另一个数据框，其中包含这些域之间的连接（也称为边缘）：

           source_domain    destination_domain
0             airbnb.com            google.com
1           facebook.com            google.com
2                 st.org          facebook.com
3                 st.org            airbnb.com
4                 st.org        crunchbase.com
5               index.co        techcrunch.com
6         crunchbase.com        techcrunch.com
7         crunchbase.com            airbnb.com
8                avc.com        techcrunch.com
9         techcrunch.com                st.org
10        techcrunch.com            google.com
11        techcrunch.com          facebook.com

因为这个数据集会变得更大，我读到如果我只用整数而不是字符串表示“边缘”数据帧，我可以有更快的性能。

所以，我想知道是否有一种快速的方法来替换边缘数据帧中的每个单元格与域（即顶点）数据帧中的相应id？因此，边缘数据框中的第1行可能看起来像：

###### Before: ##################### 
1           facebook.com google.com   
###### After:  #####################   
1           1            7

我该怎么做呢？提前谢谢。

Answer 1

这是分类数据的一个很好的用例：http://pandas.pydata.org/pandas-docs/stable/categorical.html

简而言之，分类系列将在内部将每个项目表示为数字，但将其显示为字符串。当你有很多重复的字符串时，这很有用。

使用分类系列与手动将所有内容转换为整数相比，它更容易且更不容易出错。

Answer 2

我尝试实施另一个答案 - 转换为Catagorical和ints使用cat.codes：

#if always unique domain in df1 can be omit
#cats = df1['domain'].unique()
cats = df1['domain']
df2['source_domain'] = df2['source_domain'].astype('category', categories=cats)
df2['destination_domain'] = df2['destination_domain'].astype('category', categories=cats)
df2['source_code'] = df2['source_domain'].cat.codes
df2['dest_code'] = df2['destination_domain'].cat.codes
print (df2)
     source_domain destination_domain  source_code  dest_code
0       airbnb.com         google.com            0          7
1     facebook.com         google.com            1          7
2           st.org       facebook.com            2          1
3           st.org         airbnb.com            2          0
4           st.org     crunchbase.com            2          4
5         index.co     techcrunch.com            3          6
6   crunchbase.com     techcrunch.com            4          6
7   crunchbase.com         airbnb.com            4          0
8          avc.com     techcrunch.com            5          6
9   techcrunch.com             st.org            6          2
10  techcrunch.com         google.com            6          7
11  techcrunch.com       facebook.com            6          1

df2['source_domain'] = df2['source_domain'].astype('category', categories=cats).cat.codes
df2['destination_domain'] = df2['destination_domain'].astype('category', categories=cats)
                                                     .cat.codes
print (df2)
    source_domain  destination_domain
0               0                   7
1               1                   7
2               2                   1
3               2                   0
4               2                   4
5               3                   6
6               4                   6
7               4                   0
8               5                   6
9               6                   2
10              6                   7
11              6                   1

如果想要dict替换使用map：

d = dict(zip(df1.domain.values, df1.index.values))
df2['source_code'] = df2['source_domain'].map(d)
df2['dest_code'] = df2['destination_domain'].map(d)
print (df2)
     source_domain destination_domain  source_code  dest_code
0       airbnb.com         google.com            0          7
1     facebook.com         google.com            1          7
2           st.org       facebook.com            2          1
3           st.org         airbnb.com            2          0
4           st.org     crunchbase.com            2          4
5         index.co     techcrunch.com            3          6
6   crunchbase.com     techcrunch.com            4          6
7   crunchbase.com         airbnb.com            4          0
8          avc.com     techcrunch.com            5          6
9   techcrunch.com             st.org            6          2
10  techcrunch.com         google.com            6          7
11  techcrunch.com       facebook.com            6          1

Answer 3

最简单的方法是从顶点数据框生成一个字典... IF 我们可以确定它代表了将在边缘中显示的确定顶点集。。并将其与replace

一起使用

由于顶点数据帧的索引已经具有因子信息...

m = dict(zip(vertices.domain, vertices.index))
edges.replace(m)

    source_domain  destination_domain
0               0                   7
1               1                   7
2               2                   1
3               2                   0
4               2                   4
5               3                   6
6               4                   6
7               4                   0
8               5                   6
9               6                   2
10              6                   7
11              6                   1

您还可以使用stack / map / unstack

m = dict(zip(vertices.domain, vertices.index))
edges.stack().map(m).unstack()

    source_domain  destination_domain
0               0                   7
1               1                   7
2               2                   1
3               2                   0
4               2                   4
5               3                   6
6               4                   6
7               4                   0
8               5                   6
9               6                   2
10              6                   7
11              6                   1

社论

除了提供我自己的信息之外，我还想评论@ JohnZwinck的答案。

首先，categorical可以提供更快的性能。但是，我不清楚确保您可以拥有两列协调类别的方法。我的意思是协调的是每个列获得在幕后分配给每个类别的集合整数。我们知道或强制执行（不是我知道）这些整数是相同的。如果我们将它作为一个大列，然后将该列转换为分类，那将起作用......但是，我相信一旦我们再次分成两列，它将转回对象。

用另一个数据帧中的匹配ID替换Pandas中的单元格值

3 个答案: