AnalysisException:u“无法解析'name'给定输入列:[list]在spark中的sqlContext中

时间:2016-08-18 10:57:02

标签: python apache-spark linear-regression

我尝试了一个简单的例子:

data = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load("/databricks-datasets/samples/population-vs-price/data_geo.csv")

data.cache() # Cache data for faster reuse
data = data.dropna() # drop rows with missing values
data = data.select("2014 Population estimate", "2015 median sales price").map(lambda r: LabeledPoint(r[1], [r[0]])).toDF()

效果很好,但是当我尝试类似的东西时:

data = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load('/mnt/%s/OnlineNewsTrainingAndValidation.csv' % MOUNT_NAME)

data.cache() # Cache data for faster reuse
data = data.dropna() # drop rows with missing values
data = data.select("timedelta", "shares").map(lambda r: LabeledPoint(r[1], [r[0]])).toDF()
display(data)

引发错误: AnalysisException:u“无法解析'timedelta'给定的输入列:[data_channel_is_tech,...

off-course我导入了LabeledPoint和LinearRegression

可能出现什么问题?

即使是更简单的案例

df_cleaned = df_cleaned.select("shares")

引发相同的AnalysisException(错误)。

*请注意:df_cleaned.printSchema()效果很好。

5 个答案:

答案 0 :(得分:4)

我发现了这个问题:一些列名在名称本身之前包含空格。 所以

data = data.select(" timedelta", " shares").map(lambda r: LabeledPoint(r[1], [r[0]])).toDF()

的工作。 我可以使用

捕获空白区域
assert " " not in ''.join(df.columns)  

现在我正在考虑一种去除空白区域的方法。任何想法都非常感谢!

答案 1 :(得分:3)

因为标题包含空格或制表符,请删除空格或制表符并尝试

1)我的示例脚本

from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()

df=spark.read.csv(r'test.csv',header=True,sep='^')
print("#################################################################")
print df.printSchema()
df.createOrReplaceTempView("test")
re=spark.sql("select max_seq from test")
print(re.show())
print("################################################################")

2)输入文件,这里'max_seq'包含空格,所以我们得到了以下异常

Trx_ID^max_seq ^Trx_Type^Trx_Record_Type^Trx_Date

Traceback (most recent call last):
  File "D:/spark-2.1.0-bin-hadoop2.7/bin/test.py", line 14, in <module>
    re=spark.sql("select max_seq from test")
  File "D:\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\session.py", line 541, in sql
  File "D:\spark-2.1.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py", line 1133, in __call__
  File "D:\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: u"cannot resolve '`max_seq`' given input columns: [Venue_City_Name, Trx_Type, Trx_Booking_Status_Committed, Payment_Reference1, Trx_Date, max_seq , Event_ItemVariable_Name, Amount_CurrentPrice, cinema_screen_count, Payment_IsMyPayment, r

2)删除'max_seq'列之后的空格然后它将正常工作

Trx_ID^max_seq^Trx_Type^Trx_Record_Type^Trx_Date


17/03/20 12:16:25 INFO DAGScheduler: Job 3 finished: showString at <unknown>:0, took 0.047602 s
17/03/20 12:16:25 INFO CodeGenerator: Code generated in 8.494073 ms
  max_seq
    10
    23
    22
    22
only showing top 20 rows

None
##############################################################

答案 2 :(得分:0)

As there were tabs in my input file, removing the tabs or spaces in the header helped display the answer.

My example:

saledf = spark.read.csv("SalesLTProduct.txt", header=True, inferSchema= True, sep='\t')


saledf.printSchema()

root
|-- ProductID: string (nullable = true)
|-- Name: string (nullable = true)
|-- ProductNumber: string (nullable = true)

saledf.describe('ProductNumber').show()

 +-------+-------------+
 |summary|ProductNumber|
 +-------+-------------+
 |  count|          295|
 |   mean|         null|
 | stddev|         null|
 |    min|      BB-7421|
 |    max|      WB-H098|
 +-------+-------------+

答案 3 :(得分:0)

如果标题中没有空格,则当您根本没有为 csv 指定标题时也会出现此错误:

df = sqlContext.read.csv('data.csv')

所以你需要把它改成这样:

df = sqlContext.read.csv('data.csv', header=True)

答案 4 :(得分:0)

最近,我在研究 Azure 突触分析时遇到了这个问题;我的错误是一样的。

analysisexception: cannot resolve '`xxxxxx`' given input columns: [];; 'filter ('passenger_count > 0) +- relation[] csv traceback (most recent call last):

 file "/opt/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 1364, in filter jdf = self._jdf.filter(condition._jc) file "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ answer, self.gateway_client, self.target_id, self.name)
 file "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 75, in deco raise analysisexception(s.split(': ', 1)[1], stacktrace)""

此错误是由于我们的代码或 CSV 文件中的措辞不当造成的 使用此代码读取 csv 文件:

-df = spark.read.load("examples/src/main/resources/people.csv",
                     format="csv", sep=";", inferSchema="true", header="true")

如果您再次卡在突触或 pyspark 的某个地方,请访问此站点以获取错误信息:https://docs.actian.com/avalanche/index.html#page/User/Common_Data_Loading_Error_Messages.htm

有关更多信息,请访问文档:https://spark.apache.org/docs/latest/api/python/