我从 jupyter notebook 下载了一个 .py 文件,我的目标是为每日抓取设置任务调度程序。这个文件 (scrape.py) 的目标是从网站上抓取数据并保存为 html (output_scraped.html)。
代码如下:
from bs4 import BeautifulSoup
import requests
# assign destination
url = a url
# Grab content of that url
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
titles = []
levels = soup.find_all('article', {'class' : '1234'})
for level in levels:
divs = level.find_all('a', {'class' : '5678'})
for div in divs:
titles.append(div.text)
hirer = []
for level in levels:
hirer_divs = level.find_all('span', {'class' : '9873'})
for hirer_div in hirer_divs:
hirer.append(hirer_div.text)
mylist = []
ids_final = soup.find_all(attrs={"data-id": '5tw287'})
for ifn in ids_final:
mylist.append(ifn["data-id"])
# # Putting it all together
for one, two, three in zip(titles, hirer, mylist):
final = print(one, two, three)
# In[16]:
# converting to html file
from nbconvert import HTMLExporter
import codecs
import nbformat
notebook_name = 'scrape.py'
output_file_name = 'output_scraped.html'
exporter = HTMLExporter()
output_notebook = nbformat.read(notebook_name, as_version=4)
output, resources = exporter.from_notebook_node(output_notebook)
codecs.open(output_file_name, 'w', encoding='utf-8').write(output)
以上似乎在 jupyter notebook 上运行没有任何问题,但是,当在 .py 文件上运行时,它会产生输出,直到 #Putting it all together 部分,然后给我这个相当令人生畏的错误:
Traceback (most recent call last):
File "C:\Python38\lib\site-packages\nbformat\reader.py", line 14, in parse_json
nb_dict = json.loads(s, **kwargs)
File "C:\Python38\lib\json\__init__.py", line 357, in loads
return _default_decoder.decode(s)
File "C:\Python38\lib\json\decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Python38\lib\json\decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Python38\a-SEEK_scrape.py", line 166, in <module>
output_notebook = nbformat.read(notebook_name, as_version=4)
File "C:\Python38\lib\site-packages\nbformat\__init__.py", line 141, in read
return reads(f.read(), as_version, **kwargs)
File "C:\Python38\lib\site-packages\nbformat\__init__.py", line 73, in reads
nb = reader.reads(s, **kwargs)
File "C:\Python38\lib\site-packages\nbformat\reader.py", line 58, in reads
nb_dict = parse_json(s, **kwargs)
File "C:\Python38\lib\site-packages\nbformat\reader.py", line 17, in parse_json
raise NotJSONError(("Notebook does not appear to be JSON: %r" % s)[:77] + "...") from e
nbformat.reader.NotJSONError: Notebook does not appear to be JSON: '#!/usr/bin/env python\n# coding: utf-8\...
>>>
是不是因为它不是 JSON 文件?为什么会这样呢?如果我找到一种方法将其转换为 JSON,它会实现我最初想要做的事情吗?任何帮助/指针将不胜感激。谢谢!