Question

我有两个包含相同字符串列（主机名）的数据框，并且我想计算两个数据框之间主机名对的所有可能组合之间的Levenshtein距离，并将结果放入第三个数据框，在其中我保持每个可能的距离组合，还有该组合的两个索引。

例如，假设我有以下两个数据框：

Index      Hostname
85608             dlt-rly-tracker-3.datto.com
9378      lnv7bc4241e2.1528.ozvision.ozsn.net
22791             dlt-rly-tracker-1.datto.com
88922                                 pw-file
94560     lnv7bc4241e2.1528.ozvision.ozsn.net
13245                                       -
63604                                 pw-file
435839                                pw-file
95473                                       -
13856                                 pw-file
210705                                pw-file
30046                                       -
106917            dlt-rly-tracker-2.datto.com
415925                                pw-file
170471                                pw-file
73971                                       -
86885             dlt-rly-tracker-3.datto.com
162764                                pw-file
74791                                 pw-file

和第二个数据帧：

Index     Hostname
93358                  device.dattobackup.com
34067             dlt-rly-tracker-5.datto.com
18083               46.104.89.54.in-addr.arpa
96798                                 pw-file
130940                                pw-file
31476     lnv7bc4241e2.1528.ozvision.ozsn.net
149723                                pw-file
52901                                       -
308834    lnv7bc4241e2.1528.ozvision.ozsn.net
24196                                 pw-file
69038                                       -
244454    lnv7bc4241e2.1528.ozvision.ozsn.net
2867                                        -
45549                        daisy.ubuntu.com
334378                                pw-file
86006               46.104.89.54.in-addr.arpa
430257                                pw-file
86150               46.104.89.54.in-addr.arpa
65189                                 pw-file

我要做的是获取主机名的第一个值（dlt-rly-tracker-3.datto.com），并计算第二个数据帧中主机名的所有值的levenshtein距离（一个一个地）。在此过程结束时，将结果存储在新的数据框中，该数据框中的外观类似于以下内容：

Indexes         Distance    Hostnames
85608-93358     23          dlt-rly-tracker-3.datto.com,device.dattobackup.com
85608-34067     60          dlt-rly-tracker-3.datto.com,dlt-rly-tracker-5.datto.com

我非常感谢您为解决我的问题提供的帮助。谢谢。

Answer 1

以下解决方案将遍历两个数据框，并使用所需数据创建一个新字典。然后，您应该将此字典转换为数据框。让我知道这是否有帮助！

 dist = {}
 for rowname, row in df.iterrows(): 
      for rowname1, row1 in df1.iterrows(): 
            L = Levenstein(row.Hostname, row1.Hostname)
            dist.update( {rowname+’-‘+rowname1 : (L, row.Hostname+’,’+row1.Hostname} )

Answer 2

这是我的解决方案。

var Mousetrap = require('mousetrap');
Mousetrap.bind('4', function() { console.log('4'); });

在这里您需要创建两个DataFrame。我假设它们被称为：

df1

df2

import pandas as pd
from nltk import edit_distance

计算来自两个不同数据帧的两个字符串列之间的Levenshtein距离

2 个答案: