具有多个条件的Sparksql过滤(使用where子句选择)

时间:2015-11-17 01:51:21

标签: python sql apache-spark apache-spark-sql pyspark

您好我有以下问题:

numeric.registerTempTable("numeric"). 

我想要过滤的所有值都是文字空字符串,而不是N / A或Null值。

我尝试了这三个选项:

  1. numeric_filtered = numeric.filter(numeric['LOW'] != 'null').filter(numeric['HIGH'] != 'null').filter(numeric['NORMAL'] != 'null')

  2. numeric_filtered = numeric.filter(numeric['LOW'] != 'null' AND numeric['HIGH'] != 'null' AND numeric['NORMAL'] != 'null')

  3. sqlContext.sql("SELECT * from numeric WHERE LOW != 'null' AND HIGH != 'null' AND NORMAL != 'null'")

  4. 不幸的是,numeric_filtered始终为空。我检查并且数字具有应根据这些条件过滤的数据。

    以下是一些示例值:

    低高正常

    3.5 5.0 null

    2.0 14.0 null

    null 38.0 null

    null null null

    1.0 null 4.0

2 个答案:

答案 0 :(得分:21)

您正在使用逻辑连接(AND)。这意味着要包含的行必须与filter不同。让我们举例说明使用numeric = sqlContext.createDataFrame([ ('3.5,', '5.0', 'null'), ('2.0', '14.0', 'null'), ('null', '38.0', 'null'), ('null', 'null', 'null'), ('1.0', 'null', '4.0')], ('low', 'high', 'normal')) numeric_filtered_1 = numeric.where(numeric['LOW'] != 'null') numeric_filtered_1.show() ## +----+----+------+ ## | low|high|normal| ## +----+----+------+ ## |3.5,| 5.0| null| ## | 2.0|14.0| null| ## | 1.0|null| 4.0| ## +----+----+------+ numeric_filtered_2 = numeric_filtered_1.where( numeric_filtered_1['NORMAL'] != 'null') numeric_filtered_2.show() ## +---+----+------+ ## |low|high|normal| ## +---+----+------+ ## |1.0|null| 4.0| ## +---+----+------+ numeric_filtered_3 = numeric_filtered_2.where( numeric_filtered_2['HIGH'] != 'null') numeric_filtered_3.show() ## +---+----+------+ ## |low|high|normal| ## +---+----+------+ ## +---+----+------+ 版本作为示例:

from pyspark.sql.functions import col 

numeric_filtered = df.where(
    (col('LOW')    != 'null') | 
    (col('NORMAL') != 'null') |
    (col('HIGH')   != 'null'))
numeric_filtered.show()

## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0|  null|
## | 2.0|14.0|  null|
## |null|38.0|  null|
## | 1.0|null|   4.0|
## +----+----+------+

您尝试过的所有剩余方法都遵循完全相同的架构。你需要的是逻辑分离(OR)。

numeric.registerTempTable("numeric")
sqlContext.sql("""SELECT * FROM numeric
    WHERE low != 'null' OR normal != 'null' OR high != 'null'"""
).show()

## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0|  null|
## | 2.0|14.0|  null|
## |null|38.0|  null|
## | 1.0|null|   4.0|
## +----+----+------+

或使用原始SQL:

{{1}}

另请参阅:Pyspark: multiple conditions in when clause

答案 1 :(得分:0)

@Configuration
@EnableResourceServer
public class ResourceServerConfig extends ResourceServerConfigurerAdapter {
    private static final String RESOURCE_ID = "resource_id";

    @Override
    public void configure(ResourceServerSecurityConfigurer resources) throws Exception{
        System.out.println("resource server configurer "+resources);
        resources.resourceId(RESOURCE_ID).stateless(false);
    }

    @Override
    public void configure(HttpSecurity http) throws Exception {

        System.out.println("resource server config");
        http
                .authorizeRequests()
                .antMatchers("api/v1/oauth/token").permitAll()
                .antMatchers("/","/css/**","/js/**","/lib/**","/img/**","/scss/**","/templates/**","/device-mockups/**","/vendor/**").permitAll()
                .anyRequest().authenticated()
                .and().exceptionHandling().accessDeniedHandler(new OAuth2AccessDeniedHandler());


    }

}