如何替换熊猫数据框中的拼写错误的单词

时间:2019-06-07 05:24:06

标签: python python-3.x pandas numpy dataframe

我有2个pandas DataFrame。其中包含一系列正确拼写的单词:

[In]: df1
[Out]:
   words
0  apple
1  phone
2  clock
3  table
4  clean

和一个拼写错误的单词:

[In]: df2
[Out]:
   misspelled
0        aple
1         phn
2        alok
3     garbage
4        appl
5         pho

目标是使用第一个DataFrame中正确拼写的单词列表替换第二个DataFrame中拼写错误的单词的列。第二个DataFrame可以有多个重复项,其大小可以与第一个重复,可以具有不在第一个DataFrame中的单词(或不能足够相似地匹配)。

我一直在尝试成功使用difflib.get_close_matches,但是效果并不理想。

这是我到目前为止所拥有的:

x = list(map(lambda x: get_close_matches(x, df1.col1), df2.col1))
good_words = list(map(''.join, x))
l = np.array(good_words, dtype='object')
df2.col1 = pd.Series(l)
df2 = df2[df2.col1 != '']

应用转换后,我应该使第二个DataFrame看起来像这样:

[In]: df2
[Out]:
          0
0     apple
1     phone
2     clock
3       NaN
4     apple
5     phone

如果未找到匹配项,则该行将替换为NaN。我的问题是我得到的结果看起来像这样:

[In]: df2
[Out]:
    misspelled
0        apple
1        phone
2   clockclean
3          NaN
4        apple
5        phone

在撰写本文时,我还没有弄清楚为什么某些单词会被合并。我怀疑这与difflib.get_close_matches匹配长度和/或字母相似的不同单词有关。到目前为止,我从整个专栏中获得了约10%-15%的单词组合。 预先感谢。

2 个答案:

答案 0 :(得分:3)

如果要匹配get_close_matches返回的第一个值,请在next处使用iternp.nan作为可能的附加值-这里x = [next(iter(x), np.nan) for x in map(lambda x: difflib.get_close_matches(x, df1.words), df2.misspelled)] df2['col1'] = x print (df2) misspelled col1 0 aple apple 1 phn phone 2 alok clock 3 garbage NaN 4 appl apple 5 pho phone

@Configuration
@EnableWebSecurity
public class SecurityConfig extends WebSecurityConfigurerAdapter
{
    private final MyUserDetailsService userDetailsService;

    private final CustomBasicAuthenticationEntryPoint customBasicAuthenticationEntryPoint;

    @Autowired
    public SecurityConfig(MyUserDetailsService userDetailsService, CustomBasicAuthenticationEntryPoint customBasicAuthenticationEntryPoint)
    {
        this.userDetailsService = userDetailsService;
        this.customBasicAuthenticationEntryPoint = customBasicAuthenticationEntryPoint;
}
@Override
    public void configure(AuthenticationManagerBuilder auth)
    {
        auth.authenticationProvider(getDaoAuthenticationProvider());
    }

    @Bean
    public CustomDaoAuthenticationProvider getDaoAuthenticationProvider()
    {
        CustomDaoAuthenticationProvider daoAuthenticationProvider=new CustomDaoAuthenticationProvider();
        daoAuthenticationProvider.setUserDetailsService(userDetailsService);
        daoAuthenticationProvider.setPasswordEncoder(getBCryptPasswordEncoder());
        return daoAuthenticationProvider;
    }

    /* BCrypt strength should 12 or more*/
    @Bean
    public PasswordEncoder getBCryptPasswordEncoder()
    {
        return new BCryptPasswordEncoder(12);
    }

    @Override
    protected void configure(HttpSecurity http) throws Exception
    {
            http.authorizeRequests()
                    .antMatchers("/anonymous*").anonymous()
                    .antMatchers("/register").permitAll()
                    .antMatchers("/users/**").hasAuthority(AuthorityConstants.ADMIN)
                    .antMatchers("/admin**").hasAuthority(AuthorityConstants.ADMIN)
                    .antMatchers("/profile/**").hasAuthority(AuthorityConstants.USER)
                    .antMatchers("/api/**").hasAnyAuthority(AuthorityConstants.API_USER,AuthorityConstants.ADMIN)
                    .antMatchers("/dba/**").hasAuthority(AuthorityConstants.DBA)
                    .anyRequest().authenticated()
            .and()
                    .httpBasic()
            .and()
                    .exceptionHandling()
                    .authenticationEntryPoint(customBasicAuthenticationEntryPoint)
            .and()
                    .formLogin()
                        .loginPage("/login")
                        .loginProcessingUrl("/login")
                    .successHandler(new CustomAuthenticationSuccessHandler(sessionHistoryRepository))
                    .failureHandler(new CustomAuthenticationFailureHandler(failedLoginRepository))
                        .permitAll()
                    .and()
                    .logout()
                        .deleteCookies("X-Auth-Token")
                        .clearAuthentication(true)
                        .invalidateHttpSession(true)
                        .logoutSuccessHandler(new CustomLogoutSuccessHandler())
                        .permitAll()
             .and()
                    .exceptionHandling()
                    .accessDeniedHandler(new CustomAccessDeniedHandler(unauthorizedRequestRepository))
            .and()
                    .rememberMe().rememberMeServices(springSessionRememberMeServices());

        // Uses CorsConfigurationSource bean defined below
        http.cors();

        http.sessionManagement()
                        //.invalidSessionUrl("/login.html")
                        //.invalidSessionStrategy((request, response) -> request.logout())
                        .sessionFixation().migrateSession()
                        .maximumSessions(1)
                        .maxSessionsPreventsLogin(false)
                        .sessionRegistry(sessionRegistry());

        http.csrf()
            .disable();
        http.authorizeRequests()
            .antMatchers("/").permitAll()
                .and()
            .authorizeRequests().antMatchers("/console/**","/h2-console/**").permitAll();
        http.headers()
             .frameOptions().disable();

    }

    @Bean
    public SpringSessionRememberMeServices springSessionRememberMeServices()
    {
        SpringSessionRememberMeServices rememberMeServices = new SpringSessionRememberMeServices();
        rememberMeServices.setRememberMeParameterName("remember-me");
        rememberMeServices.setValiditySeconds(ApplicationConstants.REMEMBERMETIMEOUT);
        return rememberMeServices;
    }

    //Cors filter to accept incoming requests
   @Bean
    CorsConfigurationSource corsConfigurationSource()
    {
        CorsConfiguration configuration = new CorsConfiguration();
        configuration.applyPermitDefaultValues();
        configuration.setAllowedMethods(Collections.singletonList("*"));
        configuration.setAllowCredentials(true);
        UrlBasedCorsConfigurationSource source = new UrlBasedCorsConfigurationSource();
        source.registerCorsConfiguration("/**", configuration);
        return source;
    }


    @Override
    public void configure(WebSecurity web) throws Exception
    {
        web
            .ignoring()
            .antMatchers("/resources/**", "/static/**", "/css/**", "/js/**", "/images/**","/h2-console/**","/console/**");
    }


    @Bean("authenticationManager")
    @Override
    public AuthenticationManager authenticationManagerBean() throws Exception
    {
        return super.authenticationManagerBean();
    }

    @Bean
    public SessionRegistry sessionRegistry()
    {
        return new SessionRegistryImpl();
    }

}

答案 1 :(得分:1)

另一种方法是使用 pandas-dedupe
由于您有一个凌乱的数据集和一个规范的数据集(即公报),您可以执行地名词典重复数据删除。

pandas-dedupe 可能特别强大,因为它将主动学习与逻辑回归和聚类相结合。它让用户可以很好地控制如何执行重复数据删除,同时将繁重的工作降到最低。

示例代码

import pandas as pd
import pandas dedupe

import pandas as pd
import pandas_dedupe

clean_data = pd.DataFrame({'name': ['apple', 'phone', 'clock', 'table', 'clean']})
messy_data = pd.DataFrame({'name':['aple', 'phn', 'alok', 'garbage', 'appl', 'apple', 'clock', 'phone', 'phone']})


dd = pandas_dedupe.gazetteer_dataframe(
    clean_data, 
    messy_data, 
    field_properties = 'name', 
    canonicalize=True,
    )

# At this point, pandas-dedupe will ask you to label a few examples as duplicates or distinct.   
# Once done, you hit finish and the output will look like this:   

# name      cluster id  confidence  canonical_name
# 0 aple    0.0          0.636356   apple
# 1 phn     1.0          0.712090   phone
# 2 alok    2.0          0.492138   clock
# 3 garbage NaN          NaN        NaN
# 4 appl    0.0          0.906788   apple
# 5 apple   0.0          0.921466   apple
# 6 clock   2.0          0.921466   clock
# 7 phone   1.0          0.921466   phone
# 8 phone   1.0          0.921466   phone

我知道这个问题很老,但我希望这个例子将来对某人有用:)