我有一个python脚本依赖于另一个文件,这对其他脚本也是必不可少的,所以我已将其压缩并运送到作为spark-submit作业运行,但遗憾的是它似乎不起作用,这里是我的代码片段和错误我一直都在
from pyspark import SparkConf, SparkContext
from pyspark.sql.session import SparkSession
def main(spark):
employee = spark.read.json("/storage/hadoop/hadoop-3.0.0/bin/employees.json")
# employee = spark.read.json("/storage/hadoop/hadoop-3.0.0/bin/employee.json")
employee.printSchema()
employee.show()
people = spark.read.json("/storage/hadoop/hadoop-3.0.0/bin/people.json")
people.printSchema()
people.show()
employee.createOrReplaceTempView("employee")
people.createOrReplaceTempView("people")
newDataFrame = employee.join(people,(employee.name==people.name),how="inner")
newDataFrame.distinct().show()
return "Hello I'm Done Processing the Operation"
这也是其他模块调用的外部依赖项,这是另一个尝试执行文件的脚本
from pyspark import SparkConf, SparkContext
from pyspark.sql.session import SparkSession
def sampleTest(output):
print output
if __name__ == "__main__":
#Application Name for the Spark RDD using Python
# APP_NAME = "Spark Application"
spark = SparkSession \
.builder \
.appName("Spark Application") \
.config("spark.master", "spark://192.168.2.3:7077") \
.getOrCreate()
# .config() \
import SparkFileMerge
abc = SparkFileMerge.main(spark)
sampleTest(abc)
现在我正在执行命令
./spark-submit --py-files /home/varun/SparkPythonJob.zip /home/varun/main.py
它给了我以下错误。
Traceback (most recent call last):
File "/home/varun/main.py", line 18, in <module>
from SparkFileMerge import SparkFileMerge
ImportError: No module named SparkFileMerge
任何帮助都将受到高度赞赏。
答案 0 :(得分:0)
撰写this
的内容是什么?
首先,您应检查第一个代码段是否实际位于名为SparkPythonJob.zip
的文件中。