我有一个看起来像这样的数据集
name col1 col2 col13
company1 Banking Finance B&F
company2 Utilities Utilities NaN
company3 Transportation Pipeline Transportation Utilities
company4 Consulting Tech Insurance
等.........
所以我需要做的是将每一列相互比较,并标记彼此根本不相似(或同义)的那些列。例如-公司4没有类似的东西,我要标记它。公司3看起来有点相似,所以我想标记为几乎相似(黄色标志),并且匹配的绿色是绿色。
The output somewhat needs to look like this :
name col1 col2 col13 flag
company1 Banking Finance B&F green
company2 Utilities Utilities NaN green
company3 Transportation Pipeline Transportation Utilities yellow
company4 Consulting Tech Insurance red
我知道这似乎是一个非常大的问题,但是有人可以给我一个起点,例如如何解决这个问题。我在这里可以使用哪些字符串匹配算法?
谢谢
答案 0 :(得分:0)
首先,可以使用fuzzywuzzy中的ratio
或partial_ratio
来获得同一行单元格之间的字符串相似性。接下来,您还可以使用WordNet(nltk
这样的词法数据库,比较同一行的每个单元格是否彼此都是同义词。需要注意的是,每个单词的建议同义词都是详尽无遗的,可能并不全面-我们在使用WordNet时可以看到这一点,但是Banking
,Finance
和B&F
被标记为红色。但是,这两种方法可能有助于您入门。
首先安装依赖项:
pip install nltk fuzzywuzzy
下载WordNet:
python
>>> nltk.download('wordnet')
[nltk_data] Downloading package wordnet to
[nltk_data] C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data] Unzipping corpora\wordnet.zip.
True
执行脚本:
import pandas as pd
from nltk.corpus import wordnet as wn
from fuzzywuzzy import fuzz
df = pd.DataFrame({
'name': ['company 1', 'company 2', 'company 3', 'company 4'],
'col1': ['Banking', 'Utilities', 'Transportation', 'Consulting'],
'col2': ['Finance', 'Utilities', 'Pipeline Transportation', 'Utilities'],
'col3': ['B&F', 'NaN', 'Utilities', 'Insurance'],
})
def get_synonyms(word):
synonym_list = []
for synset in wn.synsets(word):
for lemma in synset.lemmas():
if not lemma.name() in synonym_list:
synonym_list.append(lemma.name().replace('_',' '))
return synonym_list
def check_flag(row):
col1_data = row['col1']
col2_data = row['col2']
col3_data = row['col3']
green_flag_threshold = 80
# Get Fuzzy Ratio
fuzz_ratio_1 = fuzz.ratio(col1_data,col2_data)
fuzz_ratio_2 = fuzz.ratio(col2_data,col3_data)
fuzz_ratio_3 = fuzz.ratio(col3_data,col1_data)
if col1_data == col2_data or col2_data == col3_data or col3_data == col1_data or green_flag_threshold < (fuzz_ratio_1 or fuzz_ratio_2 or fuzz_ratio_3):
return 'green'
# Check synonyms using Wordnet (nltk)
col1_syn_list = get_synonyms(col1_data)
col2_syn_list = get_synonyms(col2_data)
col3_syn_list = get_synonyms(col3_data)
all_data = [col1_data, col2_data, col3_data]
for col_data in all_data:
for word in col_data.split():
if word in (col1_syn_list or col2_syn_list or col3_syn_list):
return 'yellow'
return 'red'
df['flag'] = df.apply(check_flag, axis=1)
print(df)
结果:
col1 col2 col3 name flag
0 Banking Finance B&F company 1 red
1 Utilities Utilities NaN company 2 green
2 Transportation Pipeline Transportation Utilities company 3 yellow
3 Consulting Utilities Insurance company 4 red
Process finished with exit code 0