我正在尝试使用pyspark将pdf文件保存到HDFS,但在查找示例时遇到了问题。那里的数据主要显示了如何保存csv文件,我可以做的很好。下面是我尝试过的代码
from datetime import date,timedelta
import requests
import urllib
from pyspark.sql import SparkSession
from pyspark.sql import *
from pyspark import *
from pyspark import SparkFiles
#spark connection
spark = SparkSession.builder.master('yarn').getOrCreate()
sc = spark.sparkContext #I don't think this part works
today = (date.today() - timedelta(1)).strftime('%m.%d.%Y')
pdfLocation = "hdfs://nameservice1/Report/Report." + str(today) + ".pdf"
sc.addFile(pdfLocation)
link1 = "https://file.pdf"
response1 = requests.get(link1)
with open(SparkFiles.get(pdfLocation)) as f:
f.write(response.content)
我收到一个文件不存在的错误,但我想我是在sc.addFile行中创建的