我有2个pandas DataFrame。其中包含一系列正确拼写的单词:
[In]: df1
[Out]:
words
0 apple
1 phone
2 clock
3 table
4 clean
和一个拼写错误的单词:
[In]: df2
[Out]:
misspelled
0 aple
1 phn
2 alok
3 garbage
4 appl
5 pho
目标是使用第一个DataFrame中正确拼写的单词列表替换第二个DataFrame中拼写错误的单词的列。第二个DataFrame可以有多个重复项,其大小可以与第一个重复,可以具有不在第一个DataFrame中的单词(或不能足够相似地匹配)。
我一直在尝试成功使用difflib.get_close_matches
,但是效果并不理想。
这是我到目前为止所拥有的:
x = list(map(lambda x: get_close_matches(x, df1.col1), df2.col1))
good_words = list(map(''.join, x))
l = np.array(good_words, dtype='object')
df2.col1 = pd.Series(l)
df2 = df2[df2.col1 != '']
应用转换后,我应该使第二个DataFrame看起来像这样:
[In]: df2
[Out]:
0
0 apple
1 phone
2 clock
3 NaN
4 apple
5 phone
如果未找到匹配项,则该行将替换为NaN
。我的问题是我得到的结果看起来像这样:
[In]: df2
[Out]:
misspelled
0 apple
1 phone
2 clockclean
3 NaN
4 apple
5 phone
在撰写本文时,我还没有弄清楚为什么某些单词会被合并。我怀疑这与difflib.get_close_matches
匹配长度和/或字母相似的不同单词有关。到目前为止,我从整个专栏中获得了约10%-15%的单词组合。
预先感谢。
答案 0 :(得分:3)
如果要匹配get_close_matches
返回的第一个值,请在next
处使用iter
和np.nan
作为可能的附加值-这里x = [next(iter(x), np.nan)
for x in map(lambda x: difflib.get_close_matches(x, df1.words), df2.misspelled)]
df2['col1'] = x
print (df2)
misspelled col1
0 aple apple
1 phn phone
2 alok clock
3 garbage NaN
4 appl apple
5 pho phone
:
@Configuration
@EnableWebSecurity
public class SecurityConfig extends WebSecurityConfigurerAdapter
{
private final MyUserDetailsService userDetailsService;
private final CustomBasicAuthenticationEntryPoint customBasicAuthenticationEntryPoint;
@Autowired
public SecurityConfig(MyUserDetailsService userDetailsService, CustomBasicAuthenticationEntryPoint customBasicAuthenticationEntryPoint)
{
this.userDetailsService = userDetailsService;
this.customBasicAuthenticationEntryPoint = customBasicAuthenticationEntryPoint;
}
@Override
public void configure(AuthenticationManagerBuilder auth)
{
auth.authenticationProvider(getDaoAuthenticationProvider());
}
@Bean
public CustomDaoAuthenticationProvider getDaoAuthenticationProvider()
{
CustomDaoAuthenticationProvider daoAuthenticationProvider=new CustomDaoAuthenticationProvider();
daoAuthenticationProvider.setUserDetailsService(userDetailsService);
daoAuthenticationProvider.setPasswordEncoder(getBCryptPasswordEncoder());
return daoAuthenticationProvider;
}
/* BCrypt strength should 12 or more*/
@Bean
public PasswordEncoder getBCryptPasswordEncoder()
{
return new BCryptPasswordEncoder(12);
}
@Override
protected void configure(HttpSecurity http) throws Exception
{
http.authorizeRequests()
.antMatchers("/anonymous*").anonymous()
.antMatchers("/register").permitAll()
.antMatchers("/users/**").hasAuthority(AuthorityConstants.ADMIN)
.antMatchers("/admin**").hasAuthority(AuthorityConstants.ADMIN)
.antMatchers("/profile/**").hasAuthority(AuthorityConstants.USER)
.antMatchers("/api/**").hasAnyAuthority(AuthorityConstants.API_USER,AuthorityConstants.ADMIN)
.antMatchers("/dba/**").hasAuthority(AuthorityConstants.DBA)
.anyRequest().authenticated()
.and()
.httpBasic()
.and()
.exceptionHandling()
.authenticationEntryPoint(customBasicAuthenticationEntryPoint)
.and()
.formLogin()
.loginPage("/login")
.loginProcessingUrl("/login")
.successHandler(new CustomAuthenticationSuccessHandler(sessionHistoryRepository))
.failureHandler(new CustomAuthenticationFailureHandler(failedLoginRepository))
.permitAll()
.and()
.logout()
.deleteCookies("X-Auth-Token")
.clearAuthentication(true)
.invalidateHttpSession(true)
.logoutSuccessHandler(new CustomLogoutSuccessHandler())
.permitAll()
.and()
.exceptionHandling()
.accessDeniedHandler(new CustomAccessDeniedHandler(unauthorizedRequestRepository))
.and()
.rememberMe().rememberMeServices(springSessionRememberMeServices());
// Uses CorsConfigurationSource bean defined below
http.cors();
http.sessionManagement()
//.invalidSessionUrl("/login.html")
//.invalidSessionStrategy((request, response) -> request.logout())
.sessionFixation().migrateSession()
.maximumSessions(1)
.maxSessionsPreventsLogin(false)
.sessionRegistry(sessionRegistry());
http.csrf()
.disable();
http.authorizeRequests()
.antMatchers("/").permitAll()
.and()
.authorizeRequests().antMatchers("/console/**","/h2-console/**").permitAll();
http.headers()
.frameOptions().disable();
}
@Bean
public SpringSessionRememberMeServices springSessionRememberMeServices()
{
SpringSessionRememberMeServices rememberMeServices = new SpringSessionRememberMeServices();
rememberMeServices.setRememberMeParameterName("remember-me");
rememberMeServices.setValiditySeconds(ApplicationConstants.REMEMBERMETIMEOUT);
return rememberMeServices;
}
//Cors filter to accept incoming requests
@Bean
CorsConfigurationSource corsConfigurationSource()
{
CorsConfiguration configuration = new CorsConfiguration();
configuration.applyPermitDefaultValues();
configuration.setAllowedMethods(Collections.singletonList("*"));
configuration.setAllowCredentials(true);
UrlBasedCorsConfigurationSource source = new UrlBasedCorsConfigurationSource();
source.registerCorsConfiguration("/**", configuration);
return source;
}
@Override
public void configure(WebSecurity web) throws Exception
{
web
.ignoring()
.antMatchers("/resources/**", "/static/**", "/css/**", "/js/**", "/images/**","/h2-console/**","/console/**");
}
@Bean("authenticationManager")
@Override
public AuthenticationManager authenticationManagerBean() throws Exception
{
return super.authenticationManagerBean();
}
@Bean
public SessionRegistry sessionRegistry()
{
return new SessionRegistryImpl();
}
}
答案 1 :(得分:1)
另一种方法是使用 pandas-dedupe。
由于您有一个凌乱的数据集和一个规范的数据集(即公报),您可以执行地名词典重复数据删除。
pandas-dedupe 可能特别强大,因为它将主动学习与逻辑回归和聚类相结合。它让用户可以很好地控制如何执行重复数据删除,同时将繁重的工作降到最低。
示例代码
import pandas as pd
import pandas dedupe
import pandas as pd
import pandas_dedupe
clean_data = pd.DataFrame({'name': ['apple', 'phone', 'clock', 'table', 'clean']})
messy_data = pd.DataFrame({'name':['aple', 'phn', 'alok', 'garbage', 'appl', 'apple', 'clock', 'phone', 'phone']})
dd = pandas_dedupe.gazetteer_dataframe(
clean_data,
messy_data,
field_properties = 'name',
canonicalize=True,
)
# At this point, pandas-dedupe will ask you to label a few examples as duplicates or distinct.
# Once done, you hit finish and the output will look like this:
# name cluster id confidence canonical_name
# 0 aple 0.0 0.636356 apple
# 1 phn 1.0 0.712090 phone
# 2 alok 2.0 0.492138 clock
# 3 garbage NaN NaN NaN
# 4 appl 0.0 0.906788 apple
# 5 apple 0.0 0.921466 apple
# 6 clock 2.0 0.921466 clock
# 7 phone 1.0 0.921466 phone
# 8 phone 1.0 0.921466 phone
我知道这个问题很老,但我希望这个例子将来对某人有用:)