如何在spark-shell / pyspark中打印出RDD的片段?

时间:2015-06-29 12:35:50

标签: apache-spark pyspark

在spark-shell中工作时,我经常想检查RDD(类似于在unix中使用head)。

例如:

scala> val readmeFile = sc.textFile("input/tmp/README.md")
scala> // how to inspect the readmeFile?

和......

scala> val linesContainingSpark = readmeFile.filter(line => line.contains("Spark"))
scala> // how to inspect linesContainingSpark?

1 个答案:

答案 0 :(得分:16)

我发现了如何执行此操作(here)并认为这对其他用户非常有用,因此请在此处分享。 take(x)选择前x个项目,foreach打印它们:

scala> val readmeFile = sc.textFile("input/tmp/README.md")
scala> readmeFile.take(5).foreach(println)
# Apache Spark

Spark is a fast and general cluster computing system for Big Data. It provides
high-level APIs in Scala, Java, and Python, and an optimized engine that
supports general computation graphs for data analysis. It also supports a

和......

scala> val linesContainingSpark = readmeFile.filter(line => line.contains("Spark"))
scala> linesContainingSpark.take(5).foreach(println)
# Apache Spark
Spark is a fast and general cluster computing system for Big Data. It provides
rich set of higher-level tools including Spark SQL for SQL and structured
and Spark Streaming.
You can find the latest Spark documentation, including a programming

以下示例相当于使用pyspark:

>>> readmeFile = sc.textFile("input/tmp/README.md")
>>> for line in readmeFile.take(5): print line
... 
# Apache Spark

Spark is a fast and general cluster computing system for Big Data. It provides
high-level APIs in Scala, Java, and Python, and an optimized engine that
supports general computation graphs for data analysis. It also supports a

>>> linesContainingSpark = readmeFile.filter(lambda line: "Spark" in line)
>>> for line in linesContainingSpark.take(5): print line
... 
# Apache Spark
Spark is a fast and general cluster computing system for Big Data. It provides
rich set of higher-level tools including Spark SQL for SQL and structured
and Spark Streaming.
You can find the latest Spark documentation, including a programming