Question

我无法在数据帧上使用过滤器。我不断收到错误“ TypeError（“条件应该为字符串或列”）“

我尝试将过滤器更改为使用col对象。尽管如此，这是行不通的。

path = 'dbfs:/FileStore/tables/TravelData.txt'
data = spark.read.text(path)
from pyspark.sql.types import StructType, StructField, IntegerType , StringType, DoubleType
schema = StructType([
  StructField("fromLocation", StringType(), True),
  StructField("toLocation", StringType(), True),
  StructField("productType", IntegerType(), True)
])
df = spark.read.option("delimiter", "\t").csv(path, header=False, schema=schema)
from pyspark.sql.functions import col
answerthree = df.select("toLocation").groupBy("toLocation").count().sort("count", ascending=False).take(10)  # works fine
display(answerthree)

余的过滤器添加到变量 “answerthree”，如下所示：

answerthree = df.select("toLocation").groupBy("toLocation").count().filter(col("productType")==1).sort("count", ascending=False).take(10)

有投掷误差如下：给定输入列，“”无法解析'productType'“条件应为字符串或列”

在JIST，我试图解决在使用pyspark链路代替财政下面给定的问题3。网址下方也提供了数据集。 https://acadgild.com/blog/spark-use-case-travel-data-analysis?fbclid=IwAR0fgLr-8aHVBsSO_yWNzeyh7CoiGraFEGddahDmDixic6wmumFwUlLgQ2c

我应该能够得到期望的结果只对productType值1

Answer 1

由于没有引用数据帧的变量，所以最简单的方法是使用字符串条件：

answerthree = df.select("toLocation").groupBy("toLocation").count()\
                .filter("productType = 1")\
                .sort(...

可替换地，可以使用一个数据帧可变，并使用基于列的过滤器：

count_df = df.select("toLocation").groupBy("toLocation").count()
answerthree = count_df.filter(count_df['productType'] == 1)\
                      .sort("count", ascending=False).take(10)

pyspark数据框“条件应该为字符串或列”

1 个答案: