我有一个关于使用带有BeautifulSoup的pyspark来解析html文件(从HDFS)到csv文件,然后将csv文件保存在HDFS中的问题。
这是代码和错误:
from bs4 import BeautifulSoup
html_path = hdfs://.../user/root/input/.../index.html"
soup = BeautifulSoup(open(html_path))**
然后我有以下错误:
IOError Traceback (most recent call last)
<ipython-input-3-875680018b76>in <module>()
----> 1 soup = BeautifulSoup(open(html_path))
IOError: [Errno 2] No such file or directory: 'hdfs://.../user/root/input/.../index.html'
如何解决?