Question

我从数据中心收到数据，我必须清理并使数据有用，我最大的问题是一列可以称之为“service_description”，例如数据中心属于美发沙龙，此列是手动填充的（文本框）并包含大量数据（十亿），这里是一个小样本

service description

washed the haair 
hair washed and dried
used shampoo on har
nails manicure
nail paint
nail pant
paint the nails

我需要做的是通过破坏将分析每一行并给出特定类别的脚本来将每个类别放在一起。头发可以是前三行的类别，因为它在所有这些行中重复，而钉子是其余的类别，考虑到类别词可能拼写错误。

结果

service description          possible categories

washed the haair                       hair
hair washed and dried                  hair
used shampoo on har                    hair
nails manicure                         nail
nail paint                             nail
nail pant                              nail
paint the nails                        nail

Answer 1

我假设您的类别是固定查找。我会用白色空格分割字符串;对于每个部分，我将浏览您的类别查找中的所有项目，并选择具有最小levenshtein距离的项目。

一些参考文献：

http://en.wikipedia.org/wiki/Levenshtein_distance

http://www.codeproject.com/Articles/13525/Fast-memory-efficient-Levenshtein-algorithm

使用语音标准化文本

1 个答案: