解析完成后,我将数据保存到json文件。(工作正常) 我在这里粘贴了代码:
import json
import os
from newspaper import Article
import newspaper
# initiating the server
server_start = os.system('start "HTTP Server on port 8000" cmd.exe /c {python -m http.server}')
http_server = 'http://localhost:8000/'
links = ''
path = "<path>"
for f in os.listdir(path):
if f.endswith('.html'):
links = http_server + path + f
blog_post = newspaper.build(links)
for article in blog_post.articles:
print(article.url)
article = Article(links)
article.download('')
article.parse()
data = {"HTML": article.html, "author": article.authors, "title": article.title, "text": article.text, "date": str(article.publish_date)}
json_data = json.dumps(data)
with open('data.json', 'w') as outfile:
json.dump(data, outfile)
错误讯息:
... \ newspaper \ Scripts \ python.exe“... / parsing_newspaper / test1.py” [来源解析ERR] http://localhost:8000/.../cnnpolitics-russian.html 回溯(最近一次调用最后一次):
文件“... \ newspaper \ lib \ site-packages \ newspaper \ parsers.py”,第68行, instring cls.doc = lxml.html.fromstring(html)
文件“... \ newspaper \ lib \ site-packages \ lxml \ html__init __。py”,行 876,instring doc = document_fromstring(html,parser = parser, base_url = base_url,** kw)
文件“... \ newspaper \ lib \ site-packages \ lxml \ html__init __。py”,行 762,在document_fromstring中= etree.fromstring(html,parser, **千瓦)
文件“src \ lxml \ lxml.etree.pyx”,第3213行,位于lxml.etree.fromstring (SRC \ LXML \ lxml.etree.c:78994)
文件“src \ lxml \ parser.pxi”,第1848行,in lxml.etree._parseMemoryDocument(src \ lxml \ lxml.etree.c:118325)
文件“src \ lxml \ parser.pxi”,第1729行,在lxml.etree._parseDoc中 (SRC \ LXML \ lxml.etree.c:116883)
文件“src \ lxml \ parser.pxi”,第1063行,in lxml.etree._BaseParser._parseUnicodeDoc (SRC \ LXML \ lxml.etree.c:110870)
文件“src \ lxml \ parser.pxi”,第595行,in lxml.etree._ParserContext._handleParseResultDoc (SRC \ LXML \ lxml.etree.c:105093)
文件“src \ lxml \ parser.pxi”,第706行,in lxml.etree._handleParseResult(src \ lxml \ lxml.etree.c:106801)
文件“src \ lxml \ parser.pxi”,第646行,位于lxml.etree._raiseParseError (SRC \ LXML \ lxml.etree.c:105947)
文件“”,第0行lxml.etree.XMLSyntaxError:
在致电
download()
之前,您必须parse()
一篇文章!回溯(最近一次呼叫最后):文件“... / test1.py”,第26行,in article.parse()
文件“... \ newspaper \ lib \ site-packages \ newspaper \ article.py”,第168行, 在解析时引发ArticleException()report.article.ArticleException
答案 0 :(得分:1)
不知道这是否有帮助,但试试这个:
SELECT CONVERT(VARCHAR(10), BDATE , 120) from PersonData
因为否则如果第一个文件不是具有html扩展名的文件,那么您尝试构建一个空字符串。
或 如果第一个是带有html扩展名的文件,但第二个不是,那么你将构建相同的文件(至少)两次
答案 1 :(得分:0)
在深入调试之前要遵循的检查清单: