我有两个数据帧df1和df2。 在df1中我有50列,在df2中我有50列以上。我想要实现的是 在df1中,我有13000行和一个列名称主题,其中给出了所有主题的名称。 在df2中,我有250行,沿着50+,我有两列名为subject code和subject_name。
Here is an example of my datasets:
df1 =
index subjects
0 Biology
1 Physicss
2 Chemistry
3 Biology
4 Physics
5 Physics
6 Biolgy
df2 =
index subject_name subject_code
0 Biology BIO
1 Physics PHY
2 Chemistry CHE
3 Medical MED
4 Programming PRO
5 Maths MAT
6 Literature LIT
My desired output in df1 (after replacing subject_name and fixing the spelling errors) is:
index subjects subject_code
0 Biology BIO
1 Physics PHY
2 Chemistry CHE
3 Biology BIO
4 Physics PHY
5 Physics PHY
6 Biology BIO
我的结局是,我想将df1中的所有主题值与df2主题名称值中的值合并。在df1中有大约500行,在我将两个列合并为一行之后我得到NAN,因为在这500行中,主题的拼写存在一些差异。 我尝试过以下链接给出的解决方案,但对我不起作用: replace df index values with values from a list but ignore empty strings
Python pandas: replace values multiple columns matching multiple columns from another dataframe
Here is my code:
df_merged = pd.merge(df1_subject,df2_subjectname, left_on='subjects', right_on='subject_name')
df_merged.head()
任何人都可以告诉我如何解决这个问题,因为我已经花了8个小时解决这个问题,但我无法解决它。
干杯
答案 0 :(得分:0)
您遇到的一个问题是拼写错误。您可以尝试使用dataframes
模块及其difflib
方法在get_close_matches
df1
之间协调主题的拼写。
使用此代码将返回df2
和df1's
中每个匹配的最匹配主题。我会更新dataframes
列以反映这一点。因此,即使主题名称拼写错误,它现在也会在import pandas as pd
import difflib
df2['subject_name'] = df2.subject_name.map(lambda x: difflib.get_close_matches(x, df1.subject)[0])
中具有相同的拼写。
SELECT SUPPLIER.SNO, PNO
FROM SUPPLIER
JOIN SHIPMENT ON SUPPLIER.SNO = SHIPMENT.SNO
WHERE PNO in 'P2, P4'
在此之后,您可以尝试合并。它可以解决您的问题,但如果您提供可重现的示例,则更容易修复。
答案 1 :(得分:0)
更正拼写然后合并......
import pandas as pd
import operator, collections
df1 = pd.DataFrame.from_items([("subjects",
["Biology","Physicss","Phsicss","Chemistry",
"Biology","Physics","Physics","Biolgy","navelgazing"])])
df2 = pd.DataFrame.from_items([("subject_name",
["Biology","Physics","Chemistry","Medical",
"Programming","Maths","Literature"]),
("subject_code",
["BIO","PHY","CHE","MED","PRO","MAT","LIT"])])
找到拼写错误:
misspelled = set(df1.subjects) - set(df2.subject_name)
找到与拼写错误最匹配的主题并创建字典 - > {mis_sp:subject_name}
difference = operator.itemgetter(1)
subject = operator.itemgetter(0)
def foo1(word, candidates):
'''Returns the most likely match for a misspelled word
'''
temp = []
for candidate in candidates:
count1 = collections.Counter(word)
count2 = collections.Counter(candidate)
diff1 = count1 - count2
diff2 = count2 - count1
diff = sum(diff1.values())
diff += sum(diff2.values())
temp.append((candidate, diff))
return subject(min(temp, key = difference))
def foo2(words):
'''Yields (misspelled-word, corrected-word) tuples from misspelled words'''
for word in words:
name = foo1(word, df2.subject_name)
if name:
yield (word, name)
d = dict(foo2(misspelled))
纠正df1中的所有拼写错误
def foo3(thing):
return d.get(thing, thing)
df3 = df1.applymap(foo3)
合并
df2 = df2.set_index("subject_name")
df3 = df3.merge(df2, left_on = "subjects", right_index = True, how = 'left')
foo1
可能足以达到此目的,但有更好,更复杂的算法来纠正拼写。也许,http://norvig.com/spell-correct.html
请阅读@conner的解决方案。我不知道difflib在那里会有更好的foo1
,
def foo1(word, candidates):
try:
return difflib.get_close_matches(word, candidates, 1)[0]
except IndexError as e:
# there isn't a close match
return None