以下代码中的uni代码编码错误是什么

时间:2019-02-08 20:14:34

标签: python xhtml2pdf

我收到以下unicode编码错误。

当我运行下面显示的程序时,我遇到了与Unicode编码有关的错误

flutter: ══╡ EXCEPTION CAUGHT BY WIDGETS LIBRARY ╞═══════════════════════════════════════════════════════════
flutter: The following assertion was thrown while finalizing the widget tree:
flutter: _HistoryViewState#a8eac(ticker active but muted) was disposed with an active Ticker.
flutter: _HistoryViewState created a Ticker via its SingleTickerProviderStateMixin, but at the time dispose()
flutter: was called on the mixin, that Ticker was still active. The Ticker must be disposed before calling
flutter: super.dispose(). Tickers used by AnimationControllers should be disposed by calling dispose() on the
flutter: AnimationController itself. Otherwise, the ticker will leak.
flutter: The offending ticker was: Ticker(created by _HistoryViewState#a8eac(lifecycle state: created))
flutter: The stack trace when the Ticker was actually created was:
flutter: #0      new Ticker.<anonymous closure> 
package:flutter/…/scheduler/ticker.dart:64
flutter: #1      new Ticker 
package:flutter/…/scheduler/ticker.dart:66
flutter: #2      __HistoryViewState&State&SingleTickerProviderStateMixin.createTicker 
package:flutter/…/widgets/ticker_provider.dart:93
flutter: #3      new AnimationController 

错误在下面给出

import bs4
import requests
from xhtml2pdf import pisa  # import python module
from xhtml2pdf.config.httpconfig import httpConfig

res = requests.get("https://www.insightsonindia.com/2018/06/04/insights-daily-current-affairs-04-june-2018/")
soup = bs4.BeautifulSoup(res.text, 'lxml')
pf = soup.find("div", class_="pf-content")

sourceHtml =str(pf)
outputFilename = "test.pdf"

def convertHtmlToPdf(sourceHtml, outputFilename):
    # open output file for writing (truncated binary)

    httpConfig.save_keys('nosslcheck', True)

    resultFile = open(outputFilename, "w+b")

    # convert HTML to PDF
    pisaStatus = pisa.CreatePDF(sourceHtml, dest=resultFile, encoding="utf-8")

    # close output file
    resultFile.close()  # close output file

    # return True on success and False on errors
    return pisaStatus.err

# Main program
if __name__ == "__main__":
    pisa.showLogging()
    convertHtmlToPdf(sourceHtml, outputFilename)

我正在尝试使用xhtml2pdf下载网站的一部分。为此,我使用了bs4并将其抓取并存储。然后使用xhtml2pdf将其保存为pdf。 大多数时候,它像魅力一样运作。但是对于这种情况,它给了我错误。链接到github中的完整代码如下

可以使用完整代码链接here

xhtml2pdf用ascii编码,由于我的html文件包含非ascii字符,因此显示错误。而且我不知道如何在xhtml2pdf中更改编码器。不能省略非ASCII字符。如果我忽略它,则指向图像的链接将被破坏,并且图像将不会以pdf显示。

完全追溯

```Traceback (most recent call last):
  File "test3.py", line 80, in 
    convertHtmlToPdf(sourceHtml, outputFilename)
  File "test3.py", line 68, in convertHtmlToPdf
    pisaStatus = pisa.CreatePDF(sourceHtml, dest=resultFile, encoding= 'utf-8')
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\document.py", line 97, in pisaDocument
    encoding, context=context, xml_output=xml_output)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\document.py", line 59, in pisaStory
    pisaParser(src, context, default_css, xhtml, encoding, xml_output)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 759, in pisaParser
    pisaLoop(document, context)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 700, in pisaLoop
    pisaLoop(node, context, path, **kw)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 644, in pisaLoop
    pisaLoop(nnode, context, path, **kw)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 644, in pisaLoop
    pisaLoop(nnode, context, path, **kw)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 644, in pisaLoop
    pisaLoop(nnode, context, path, **kw)
  [Previous line repeated 2 more times]
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 514, in pisaLoop
    attr = pisaGetAttributes(context, node.tagName, node.attributes)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\parser.py", line 124, in pisaGetAttributes
    nv = c.getFile(nv)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\context.py", line 818, in getFile
    return getFile(name, relative or self.pathDirectory)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\util.py", line 738, in getFile
    file = pisaFileObject(*a, **kw)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\site-packages\xhtml2pdf\util.py", line 644, in init
    conn.request("GET", path)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 1240, in _send_request
    self.putrequest(method, url, **skips)
  File "C:\Users\Ananthu\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 1107, in putrequest
    self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 37: ordinal not in range(128)

self._output(request.encode('ascii')) UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 37: ordinal not in range(128)

1 个答案:

答案 0 :(得分:0)

问题是所检索的html包含img标记,其某些src属性是包含'\u2019'(“右单引号”)字符的url。

xhtml2pdf会将这些网址传递给python的http.client模块,而无需先转义它们。 http.client尝试在检索URL之前将其编码为ASCII,然后发生错误。

这可以通过在生成pdf之前转义检索到的html中的url来解决。

urllib.parse提供了执行此操作的工具。

from urllib import parse
...
res = requests.get("https://www.insightsonindia.com/2018/06/04/insights-daily-current-affairs-04-june-2018/")
soup = bs4.BeautifulSoup(res.text, 'lxml')
pf = soup.find("div", class_="pf-content")

imgs = pf.find_all('img')
for img in imgs: 
    url = img['src'] 
    scheme, netloc, path, params, query, fragment = parse.urlparse(url)
    new_path = parse.quote(path)
    new_url = parse.urlunparse((scheme, netloc, new_path, params, query, fragment))
    img['src'] = new_url

sourceHtml =str(pf)
outputFilename = "test.pdf"
...

this question的答案提供了一些有关unicode和url的背景信息。