Question

我有两个数据帧df1和df2。在df1中我有50列，在df2中我有50列以上。我想要实现的是在df1中，我有13000行和一个列名称主题，其中给出了所有主题的名称。在df2中，我有250行，沿着50+，我有两列名为subject code和subject_name。

        Here is an example of my datasets:

        df1 = 
        index     subjects
        0         Biology
        1         Physicss
        2         Chemistry
        3         Biology
        4         Physics
        5         Physics
        6         Biolgy

    df2 = 
        index     subject_name    subject_code
        0         Biology         BIO
        1         Physics         PHY
        2         Chemistry       CHE
        3         Medical         MED
        4         Programming     PRO
        5         Maths           MAT
        6         Literature      LIT 

My desired output in df1 (after replacing subject_name and fixing the spelling errors) is:
            index     subjects        subject_code
            0         Biology         BIO
            1         Physics         PHY
            2         Chemistry       CHE
            3         Biology         BIO
            4         Physics         PHY
            5         Physics         PHY
            6         Biology         BIO

我的结局是，我想将df1中的所有主题值与df2主题名称值中的值合并。在df1中有大约500行，在我将两个列合并为一行之后我得到NAN，因为在这500行中，主题的拼写存在一些差异。我尝试过以下链接给出的解决方案，但对我不起作用： replace df index values with values from a list but ignore empty strings

Python pandas: replace values multiple columns matching multiple columns from another dataframe

            Here is my code:

            df_merged = pd.merge(df1_subject,df2_subjectname, left_on='subjects', right_on='subject_name')
        df_merged.head()

任何人都可以告诉我如何解决这个问题，因为我已经花了8个小时解决这个问题，但我无法解决它。

干杯

Answer 1

您遇到的一个问题是拼写错误。您可以尝试使用dataframes模块及其difflib方法在get_close_matches df1之间协调主题的拼写。

使用此代码将返回df2和df1's中每个匹配的最匹配主题。我会更新dataframes列以反映这一点。因此，即使主题名称拼写错误，它现在也会在import pandas as pd import difflib df2['subject_name'] = df2.subject_name.map(lambda x: difflib.get_close_matches(x, df1.subject)[0])中具有相同的拼写。

SELECT SUPPLIER.SNO, PNO
FROM SUPPLIER
JOIN SHIPMENT ON SUPPLIER.SNO = SHIPMENT.SNO
WHERE PNO  in 'P2, P4'

在此之后，您可以尝试合并。它可以解决您的问题，但如果您提供可重现的示例，则更容易修复。

Answer 2

更正拼写然后合并......

import pandas as pd
import operator, collections

df1 = pd.DataFrame.from_items([("subjects",
                                ["Biology","Physicss","Phsicss","Chemistry",
                                 "Biology","Physics","Physics","Biolgy","navelgazing"])])
df2 = pd.DataFrame.from_items([("subject_name",
                                ["Biology","Physics","Chemistry","Medical",
                                 "Programming","Maths","Literature"]),
                               ("subject_code",
                                ["BIO","PHY","CHE","MED","PRO","MAT","LIT"])])

找到拼写错误：

misspelled = set(df1.subjects) - set(df2.subject_name)

找到与拼写错误最匹配的主题并创建字典 - ＆gt; {mis_sp：subject_name}

difference = operator.itemgetter(1)
subject = operator.itemgetter(0)
def foo1(word, candidates):
    '''Returns the most likely match for a misspelled word
    '''
    temp = []
    for candidate in candidates:
        count1 = collections.Counter(word)
        count2 = collections.Counter(candidate)
        diff1 = count1 - count2
        diff2 = count2 - count1
        diff = sum(diff1.values())
        diff += sum(diff2.values())
        temp.append((candidate, diff))
    return subject(min(temp, key = difference))

def foo2(words):
    '''Yields (misspelled-word, corrected-word) tuples from misspelled words'''
    for word in words:
        name = foo1(word, df2.subject_name)
        if name:
            yield (word, name)

d = dict(foo2(misspelled))

纠正df1中的所有拼写错误

def foo3(thing):
    return d.get(thing, thing)

df3 = df1.applymap(foo3)

合并

df2 = df2.set_index("subject_name")
df3 = df3.merge(df2, left_on = "subjects", right_index = True, how = 'left')

foo1可能足以达到此目的，但有更好，更复杂的算法来纠正拼写。也许，http://norvig.com/spell-correct.html

请阅读@conner的解决方案。我不知道difflib在那里会有更好的foo1，

def foo1(word, candidates):
    try:
        return difflib.get_close_matches(word, candidates, 1)[0]
    except IndexError as e:
        # there isn't a close match
        return None

Python：用df2.col2值替换df1.col值

2 个答案: