替换“熊猫”列中距离度量值大于阈值的元素

时间:2019-11-20 09:20:45

标签: pandas dataframe for-loop

考虑一个数据框:

company  |  label 

  comp1       fashion
  comp2       fashionitem
  comp3       fashionable
  comp4       auto
  comp5       autoindustry
  comp6       automobile
  comp6       food
  comp7       delivery

我想稍微清理一下标签,为此我使用了一个字符串距离:

from difflib import SequenceMatcher

def distance(a, b):
    return SequenceMatcher(None, a, b).ratio()

问题是,如何编写在distance列上的任何两个元素之间应用label函数的函数,最后替换所有类似的元素(距离超过某个阈值的距离) )和最短的字符串?

结果应该类似于:

company  |  label 

  comp1       fashion
  comp2       fashion
  comp3       fashion
  comp4       auto
  comp5       auto
  comp6       auto
  comp6       food
  comp7       delivery

我正在考虑执行2个for循环,但是我的数据集可能很大,是否有一种有效的方法?

编辑:在阅读以下答复时,我意识到自己犯了一个错误。条目的总数(公司数量)很大,但是唯一标签的总数很小,少于1000。我猜可以使用df.label(unique)并使用它。

1 个答案:

答案 0 :(得分:1)

想法

此方法的想法是从阈值(ed)比率矩阵构建邻接矩阵。从邻接矩阵构建图形,并从中获取连接的组件(集群)。达到所需的输出可能很棘手,但是可以使用(绝对)阈值0.49来实现。

设置

from difflib import SequenceMatcher

import networkx as nx
import numpy as np
import pandas as pd

df = pd.DataFrame(data=[['comp1', 'fashion'],
                        ['comp2', 'fashionitem'],
                        ['comp3', 'fashionable'],
                        ['comp4', 'auto'],
                        ['comp5', 'autoindustry'],
                        ['comp6', 'automobile'],
                        ['comp6', 'food'],
                        ['comp7', 'delivery']], columns=['company', 'label'])


def distance(a, b):
    return SequenceMatcher(None, a, b).ratio()

代码

# get unique labels
labels = df['label'].unique()

# compute ratios
result = np.array([[distance(li, lj) for lj in labels] for li in labels])

# set diagonal to zero
result[np.arange(8), np.arange(8)] = 0

# build adjacency matrix
adjacency_matrix = (result > 0.49).astype(int)

# create graph
dg = nx.from_numpy_array(adjacency_matrix, create_using=nx.Graph)

# create mapping dictionary from connected components
mapping = {}
for component in nx.connected_components(dg):
    group = labels[np.array(list(component))]
    value = min(group, key=len)
    mapping.update({label: value for label in group})

result = df.assign(label=df.label.map(mapping))

print(result)

输出

  company     label
0   comp1   fashion
1   comp2   fashion
2   comp3   fashion
3   comp4      auto
4   comp5      auto
5   comp6      auto
6   comp6      food
7   comp7  delivery