语法在不同列中的字​​符串匹配

时间:2018-07-23 18:30:03

标签: python string pandas scikit-learn nltk

我有一个看起来像这样的数据集

  name      col1            col2                      col13 
  company1  Banking         Finance                   B&F
  company2  Utilities       Utilities                 NaN
  company3  Transportation  Pipeline Transportation   Utilities
  company4  Consulting      Tech                      Insurance

等.........

所以我需要做的是将每一列相互比较,并标记彼此根本不相似(或同义)的那些列。例如-公司4没有类似的东西,我要标记它。公司3看起来有点相似,所以我想标记为几乎相似(黄色标志),并且匹配的绿色是绿色。

The output somewhat needs to look like this :
  name      col1            col2                      col13       flag 
  company1  Banking         Finance                   B&F          green
  company2  Utilities       Utilities                 NaN          green
  company3  Transportation  Pipeline Transportation   Utilities   yellow
  company4  Consulting      Tech                      Insurance    red

我知道这似乎是一个非常大的问题,但是有人可以给我一个起点,例如如何解决这个问题。我在这里可以使用哪些字符串匹配算法?

谢谢

1 个答案:

答案 0 :(得分:0)

首先,可以使用fuzzywuzzy中的ratiopartial_ratio来获得同一行单元格之间的字符串相似性。接下来,您还可以使用WordNetnltk这样的词法数据库,比较同一行的每个单元格是否彼此都是同义词。需要注意的是,每个单词的建议同义词都是详尽无遗的,可能并不全面-我们在使用WordNet时可以看到这一点,但是BankingFinanceB&F被标记为红色。但是,这两种方法可能有助于您入门。

  

首先安装依赖项:

pip install nltk fuzzywuzzy
  

下载WordNet:

python
>>> nltk.download('wordnet')
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.
True
  

执行脚本:

import pandas as pd

from nltk.corpus import wordnet as wn
from fuzzywuzzy import fuzz

df = pd.DataFrame({
'name': ['company 1', 'company 2', 'company 3', 'company 4'],
'col1': ['Banking', 'Utilities', 'Transportation', 'Consulting'],
'col2': ['Finance', 'Utilities', 'Pipeline Transportation', 'Utilities'],
'col3': ['B&F', 'NaN', 'Utilities', 'Insurance'],
})

def get_synonyms(word):
    synonym_list = []
    for synset in wn.synsets(word):
        for lemma in synset.lemmas():
            if not lemma.name() in synonym_list:
                synonym_list.append(lemma.name().replace('_',' '))
    return synonym_list

def check_flag(row):

    col1_data = row['col1']
    col2_data = row['col2']
    col3_data = row['col3']

    green_flag_threshold = 80

    # Get Fuzzy Ratio
    fuzz_ratio_1 = fuzz.ratio(col1_data,col2_data)
    fuzz_ratio_2 = fuzz.ratio(col2_data,col3_data)
    fuzz_ratio_3 = fuzz.ratio(col3_data,col1_data)

    if col1_data == col2_data or col2_data == col3_data or col3_data == col1_data or green_flag_threshold < (fuzz_ratio_1 or fuzz_ratio_2 or fuzz_ratio_3):
        return 'green'

    # Check synonyms using Wordnet (nltk)
    col1_syn_list = get_synonyms(col1_data)
    col2_syn_list = get_synonyms(col2_data)
    col3_syn_list = get_synonyms(col3_data)

    all_data = [col1_data, col2_data, col3_data]

    for col_data in all_data:
        for word in col_data.split():
            if word in (col1_syn_list or col2_syn_list or col3_syn_list):
                return 'yellow'

    return 'red'

df['flag'] = df.apply(check_flag, axis=1)

print(df)
  

结果:

             col1                     col2       col3       name    flag
0         Banking                  Finance        B&F  company 1     red
1       Utilities                Utilities        NaN  company 2   green
2  Transportation  Pipeline Transportation  Utilities  company 3  yellow
3      Consulting                Utilities  Insurance  company 4     red

Process finished with exit code 0