groupBy()。count.show()在pyspark

时间:2018-05-25 16:57:51

标签: python apache-spark pyspark

我试图在我的RDD上使用groupBy()函数显示()结果。它的给予 以下错误:

Py4JJavaError: An error occurred while calling o14287.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 2669.0 failed 1 times, most recent failure: Lost task 3.0 in stage 2669.0 (TID 3896, localhost, executor driver): java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 26 fields are required while 1 values are provided.
    at org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$makeFromJava$15$$anonfun$apply$15.applyOrElse(EvaluatePython.scala:184)
    at org.apache.spark.sql.execution.python.EvaluatePython$.org$apache$spark$sql$execution$python$EvaluatePython$$nullSafeConvert(EvaluatePython.scala:208)
    at org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$makeFromJava$15.apply(EvaluatePython.scala:180)

我的Pyspark脚本

import pyspark
from pyspark.sql import SparkSession

spark=SparkSession.builder.getOrCreate()

s3RDD=spark.sparkContext.textFile("file:///Users/mydir/Documents/Projects/Pyspark/MiscScripts/logfile.gz")

firstLine = s3RDD.first()

sparkContext.parallelize convert string into RDD
parallelize = spark.sparkContext.parallelize([firstLine])

s3RDD=s3RDD.subtract(parallelize)
s3RDD=s3RDD.map(lambda x: x.split('\t'))

urlsDf=s3RDD.toDF()

#import pyspark.sql.functions as f

urlsDf.groupBy("_8").count().show() 

1 个答案:

答案 0 :(得分:0)

这是我的文字档案:

版本:1.0

字段:日期时间x-edge-location sc-bytes c-ip cs-method cs(主机)cs-uri-stem sc-status cs(Referer)cs(User-Agent)cs-uri-query cs( Cookie)x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes time-taken x-forwarded-for ssl-protocol ssl-cipher x-edge-response-result-type cs-protocol-version fle-status fle-encrypted-fields

2018-04-12 23:55:43 MAA50-C1 89352 39.44.14.521获取mycdn.com / mydir / new-my-dir-url-dangerous 200 - Mozilla / 5.0%2520(Windows%2520NT%25206.1; %2520WOW64;%2520rv:40.0)%2520Gecko / 20100101%2520Firefox / 40.0 ID = FOxPJf3rutG1qhi - CZeom7P2yw7bYn5veotj8gS2GpDTWkxZdUDiJHFwBFPSusCXKC4j9A小姐== mydomain.com HTTP 370 3.169 10.130.24.151,%2010.140.65.140 - - HTTP / 1.1小姐 - - 2018年4月12日23时55分51秒MAA50-C1 81103 39.44.14.521 GET mycdn.com / MYDIR / MYDIR-NEW-TEST1 200 - 的Mozilla / 5.0%2520(视窗%2520NT%25206.1;%2520WOW64;%2520rv:40.0 )%2520Gecko / 20100101%2520Firefox / 40.0 id = QOP645KHxGQcgXW - Miss 1wKt5erjuDVQNa7X-D - vKQeli3X1ZvE5g32D0H7vgLnq_aiVuNqDA == mydomain.com http 349 1.245 10.130.24.151,%2010.140.65.140 - - Miss HTTP / 1.1 - - 2018年4月12日23时55分59秒MAA50-C1 0 39.44.14.521 GET mycdn.com / MYDIR / MYDIR-NEW-TEST1 000 - 的Mozilla / 5.0%2520(视窗%2520NT%25206.1;%2520WOW64;%2520rv:40.0 )%2520Gecko / 20100101%2520Firefox / 40.0 ID = OCjtSXeh7QwqLtE - 错误8c9OnlJYo_2jI6mBCMFNbtxv7NSV00NjjANS2r7ODqhAlkV3Ew-4AA == mydomain.com HTTP 371 19.992 10.130.24.151,%2010.140.65.140 - - 错误HTTP / 1.1 - - 2018-04-12 23:55:45 BOM52 64704 103.18.142.29获取mycdn.com / mydir / mydir-new-test1 200 - Mozilla / 5.0%2520(Macintosh;%2520Intel%2520Mac%2520OS%2520X%252010_9_5)%2520AppleWebKit /537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/42.0.2311.90%2520Safari/537.36 - - RefreshHit UcaCxr82_Wgm-VZETVv0pxhCvoMAjO46JATyF8mBAZ0VPnGmFKGn-A == mydomain.com HTTP 312 0.022 - - - RefreshHit HTTP / 1.1 - - 2018年4月12日23时56分38秒SIN2 71625 13.228.207.150 GET mycdn.com / 200 - Mozilla的/ 5.0%2520(WINDOWS;%2520U;%2520Windows%2520NT%25206.0;%2520en-US;%2520rv:1.9。 1.6)%2520Gecko / 20091201%2520Firefox / 3.5.6%2520GTB5 - - Miss 5fGTvqY4zU-2DWBMPEvOOtaskdX-yPiwEu8RlR4fKfDRwLRetKYIlA == mydomain.com http 233 2.959 - - - Miss HTTP / 1.1 - - 2018年4月12日23时55分41秒MAA50-C1 67805 39.44.14.521 GET mycdn.com / MYDIR / MYDIR-NEW-TEST1 200 - 的Mozilla / 5.0%2520(视窗%2520NT%25206.1;%2520WOW64;%2520rv:40.0 )%2520Gecko / 20100101%2520Firefox / 40.0 id = EyTPvato7qq0qiW - Miss ZPtOvMKzHCvdS-HbAMsSTU5FfYzSmP8xnxM7KAHseJaZFMd6CykwwQ == mydomain.com http 338 1.828 10.130.24.151,%2010.140.65.140 - - Miss HTTP / 1.1 - - 2018年4月12日23时55分52秒MAA50-C1 62402 39.44.14.521 GET mycdn.com / MYDIR / MYDIR-NEW-TEST1 200 - 的Mozilla / 5.0%2520(视窗%2520NT%25206.1;%2520WOW64;%2520rv:40.0 )%2520Gecko / 20100101%2520Firefox / 40.0 ID = uGcBwdJhQC2V5sx - 4DDdtWO63B8OBw5JQ29IDv5mdcTJVVLQ0R5PvBbv6YPQSNitxwSuaw小姐== mydomain.com HTTP 356 2.675 10.130.24.151,%2010.140.65.140 - - 小姐HTTP / 1.1 - -