Question

我是pyspark的新手，我想写一个像

这样的查询我们在sql或hive中编写的

select * from table1 where column like '%word1%'。

我正在编写以下命令，

data = sqlCtx.sql('select * from table1 where column like '%word1%')

但是我遇到了错误，例如，

NameError: name 'word1' is not defined

我理想地想要有一个像

这样的条件

select word_name from table2;

会给出一个单词列表，每当这些单词出现在任何列的table1中时，我想过滤掉这些条目并给出剩余的行并将其放在数据框中。

有人可以帮我这么做吗？

由于

Answer 1

嗯，“like”函数在pyspark中运行得很好，就像在SQL中一样。使用DataFrame API和SQL API。例子：

System.out.println( company[cCompany] + " has " + num.number(mSalary, cCompany); + " of employees." );

DataFrame API：

import statsmodels.api as sm
duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")
df = sqlContext.createDataFrame(duncan_prestige.data.reset_index())

    index   type    income  education   prestige
0   accountant  prof    62  86  82
1   pilot   prof    72  76  83
2   architect   prof    75  92  90
3   author  prof    55  90  76

或使用SQL

df.filter(df['index'].like('%ilo%')).toPandas()

    index   type    income  education   prestige
0   pilot   prof    72  76  83

加入（愚蠢但要证明这一点）

df.registerTempTable('df')
sqlContext.sql("select * from df d where d.index like '%ilo%' ").toPandas()

Answer 2

这可能更简单：

input_list = tbl1.select('col1').distinct().rdd.map(lambda x: x).collect()
tbl2.where(col('col2').isin(input_list) == False)

当任何列中的其他表中有单词时，过滤掉Pyspark中的行

2 个答案: