这是我使用的代码:
df = None
from pyspark.sql.functions import lit
for category in file_list_filtered:
data_files = os.listdir('HMP_Dataset/'+category)
for data_file in data_files:
print(data_file)
temp_df = spark.read.option('header', 'false').option('delimiter', ' ').csv('HMP_Dataset/'+category+'/'+data_file, schema = schema)
temp_df = temp_df.withColumn('class', lit(category))
temp_df = temp_df.withColumn('source', lit(data_file))
if df is None:
df = temp_df
else:
df = df.union(temp_df)
我收到此错误:
NameError Traceback (most recent call last)
<ipython-input-4-4296b4e97942> in <module>
9 for data_file in data_files:
10 print(data_file)
---> 11 temp_df = spark.read.option('header', 'false').option('delimiter', ' ').csv('HMP_Dataset/'+category+'/'+data_file, schema = schema)
12 temp_df = temp_df.withColumn('class', lit(category))
13 temp_df = temp_df.withColumn('source', lit(data_file))
NameError: name 'spark' is not defined
我该如何解决?
答案 0 :(得分:1)
尝试定义spark
变量
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)
答案 1 :(得分:1)
初始化Spark Session,然后在循环中使用spark
。
df = None
from pyspark.sql.functions import lit
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('app_name').getOrCreate()
for category in file_list_filtered:
...