考虑一个数据框:
company | label
comp1 fashion
comp2 fashionitem
comp3 fashionable
comp4 auto
comp5 autoindustry
comp6 automobile
comp6 food
comp7 delivery
我想稍微清理一下标签,为此我使用了一个字符串距离:
from difflib import SequenceMatcher
def distance(a, b):
return SequenceMatcher(None, a, b).ratio()
问题是,如何编写在distance
列上的任何两个元素之间应用label
函数的函数,最后替换所有类似的元素(距离超过某个阈值的距离) )和最短的字符串?
结果应该类似于:
company | label
comp1 fashion
comp2 fashion
comp3 fashion
comp4 auto
comp5 auto
comp6 auto
comp6 food
comp7 delivery
我正在考虑执行2个for循环,但是我的数据集可能很大,是否有一种有效的方法?
编辑:在阅读以下答复时,我意识到自己犯了一个错误。条目的总数(公司数量)很大,但是唯一标签的总数很小,少于1000。我猜可以使用df.label(unique)
并使用它。
答案 0 :(得分:1)
想法
此方法的想法是从阈值(ed)比率矩阵构建邻接矩阵。从邻接矩阵构建图形,并从中获取连接的组件(集群)。达到所需的输出可能很棘手,但是可以使用(绝对)阈值0.49来实现。
设置
from difflib import SequenceMatcher
import networkx as nx
import numpy as np
import pandas as pd
df = pd.DataFrame(data=[['comp1', 'fashion'],
['comp2', 'fashionitem'],
['comp3', 'fashionable'],
['comp4', 'auto'],
['comp5', 'autoindustry'],
['comp6', 'automobile'],
['comp6', 'food'],
['comp7', 'delivery']], columns=['company', 'label'])
def distance(a, b):
return SequenceMatcher(None, a, b).ratio()
代码
# get unique labels
labels = df['label'].unique()
# compute ratios
result = np.array([[distance(li, lj) for lj in labels] for li in labels])
# set diagonal to zero
result[np.arange(8), np.arange(8)] = 0
# build adjacency matrix
adjacency_matrix = (result > 0.49).astype(int)
# create graph
dg = nx.from_numpy_array(adjacency_matrix, create_using=nx.Graph)
# create mapping dictionary from connected components
mapping = {}
for component in nx.connected_components(dg):
group = labels[np.array(list(component))]
value = min(group, key=len)
mapping.update({label: value for label in group})
result = df.assign(label=df.label.map(mapping))
print(result)
输出
company label
0 comp1 fashion
1 comp2 fashion
2 comp3 fashion
3 comp4 auto
4 comp5 auto
5 comp6 auto
6 comp6 food
7 comp7 delivery