我试图在我的RDD上使用groupBy()函数显示()结果。它的给予 以下错误:
Py4JJavaError: An error occurred while calling o14287.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 2669.0 failed 1 times, most recent failure: Lost task 3.0 in stage 2669.0 (TID 3896, localhost, executor driver): java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 26 fields are required while 1 values are provided.
at org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$makeFromJava$15$$anonfun$apply$15.applyOrElse(EvaluatePython.scala:184)
at org.apache.spark.sql.execution.python.EvaluatePython$.org$apache$spark$sql$execution$python$EvaluatePython$$nullSafeConvert(EvaluatePython.scala:208)
at org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$makeFromJava$15.apply(EvaluatePython.scala:180)
我的Pyspark脚本:
import pyspark
from pyspark.sql import SparkSession
spark=SparkSession.builder.getOrCreate()
s3RDD=spark.sparkContext.textFile("file:///Users/mydir/Documents/Projects/Pyspark/MiscScripts/logfile.gz")
firstLine = s3RDD.first()
sparkContext.parallelize convert string into RDD
parallelize = spark.sparkContext.parallelize([firstLine])
s3RDD=s3RDD.subtract(parallelize)
s3RDD=s3RDD.map(lambda x: x.split('\t'))
urlsDf=s3RDD.toDF()
#import pyspark.sql.functions as f
urlsDf.groupBy("_8").count().show()
答案 0 :(得分:0)
这是我的文字档案:
2018-04-12 23:55:43 MAA50-C1 89352 39.44.14.521获取mycdn.com / mydir / new-my-dir-url-dangerous 200 - Mozilla / 5.0%2520(Windows%2520NT%25206.1; %2520WOW64;%2520rv:40.0)%2520Gecko / 20100101%2520Firefox / 40.0 ID = FOxPJf3rutG1qhi - CZeom7P2yw7bYn5veotj8gS2GpDTWkxZdUDiJHFwBFPSusCXKC4j9A小姐== mydomain.com HTTP 370 3.169 10.130.24.151,%2010.140.65.140 - - HTTP / 1.1小姐 - - 2018年4月12日23时55分51秒MAA50-C1 81103 39.44.14.521 GET mycdn.com / MYDIR / MYDIR-NEW-TEST1 200 - 的Mozilla / 5.0%2520(视窗%2520NT%25206.1;%2520WOW64;%2520rv:40.0 )%2520Gecko / 20100101%2520Firefox / 40.0 id = QOP645KHxGQcgXW - Miss 1wKt5erjuDVQNa7X-D - vKQeli3X1ZvE5g32D0H7vgLnq_aiVuNqDA == mydomain.com http 349 1.245 10.130.24.151,%2010.140.65.140 - - Miss HTTP / 1.1 - - 2018年4月12日23时55分59秒MAA50-C1 0 39.44.14.521 GET mycdn.com / MYDIR / MYDIR-NEW-TEST1 000 - 的Mozilla / 5.0%2520(视窗%2520NT%25206.1;%2520WOW64;%2520rv:40.0 )%2520Gecko / 20100101%2520Firefox / 40.0 ID = OCjtSXeh7QwqLtE - 错误8c9OnlJYo_2jI6mBCMFNbtxv7NSV00NjjANS2r7ODqhAlkV3Ew-4AA == mydomain.com HTTP 371 19.992 10.130.24.151,%2010.140.65.140 - - 错误HTTP / 1.1 - - 2018-04-12 23:55:45 BOM52 64704 103.18.142.29获取mycdn.com / mydir / mydir-new-test1 200 - Mozilla / 5.0%2520(Macintosh;%2520Intel%2520Mac%2520OS%2520X%252010_9_5)%2520AppleWebKit /537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/42.0.2311.90%2520Safari/537.36 - - RefreshHit UcaCxr82_Wgm-VZETVv0pxhCvoMAjO46JATyF8mBAZ0VPnGmFKGn-A == mydomain.com HTTP 312 0.022 - - - RefreshHit HTTP / 1.1 - - 2018年4月12日23时56分38秒SIN2 71625 13.228.207.150 GET mycdn.com / 200 - Mozilla的/ 5.0%2520(WINDOWS;%2520U;%2520Windows%2520NT%25206.0;%2520en-US;%2520rv:1.9。 1.6)%2520Gecko / 20091201%2520Firefox / 3.5.6%2520GTB5 - - Miss 5fGTvqY4zU-2DWBMPEvOOtaskdX-yPiwEu8RlR4fKfDRwLRetKYIlA == mydomain.com http 233 2.959 - - - Miss HTTP / 1.1 - - 2018年4月12日23时55分41秒MAA50-C1 67805 39.44.14.521 GET mycdn.com / MYDIR / MYDIR-NEW-TEST1 200 - 的Mozilla / 5.0%2520(视窗%2520NT%25206.1;%2520WOW64;%2520rv:40.0 )%2520Gecko / 20100101%2520Firefox / 40.0 id = EyTPvato7qq0qiW - Miss ZPtOvMKzHCvdS-HbAMsSTU5FfYzSmP8xnxM7KAHseJaZFMd6CykwwQ == mydomain.com http 338 1.828 10.130.24.151,%2010.140.65.140 - - Miss HTTP / 1.1 - - 2018年4月12日23时55分52秒MAA50-C1 62402 39.44.14.521 GET mycdn.com / MYDIR / MYDIR-NEW-TEST1 200 - 的Mozilla / 5.0%2520(视窗%2520NT%25206.1;%2520WOW64;%2520rv:40.0 )%2520Gecko / 20100101%2520Firefox / 40.0 ID = uGcBwdJhQC2V5sx - 4DDdtWO63B8OBw5JQ29IDv5mdcTJVVLQ0R5PvBbv6YPQSNitxwSuaw小姐== mydomain.com HTTP 356 2.675 10.130.24.151,%2010.140.65.140 - - 小姐HTTP / 1.1 - -