Question

我正在尝试从文件夹中读取所有PDF文件，以使用正则表达式查找数字。经检查，PDF的字符集为“ UTF-8”。

引发此错误：

'utf-8'编解码器无法解码位置10的字节0xe2：无效连续字节

尝试以二进制模式读取，尝试使用Latin-1编码，但它会显示所有特殊字符，因此搜索中不会显示任何内容。

import os
import re
import pandas as pd
download_file_path = "C:\\Users\\...\\..\\"
for file_name in os.listdir(download_file_path):
    try:
        with open(download_file_path + file_name, 'r',encoding="UTF-8") as f:
          s = f.read()
          re_api = re.compile("API No\.\:\n(.*)")
          api = re_api.search(s).group(1).split('"')[0].strip()
          print(api)
    except Exception as e:
        print(e)

期望从PDF文件中查找API编号

Answer 1

使用return ( <> <main> <Suspense fallback={<div>Loading...</div>}> <Switch> <Route exact path="/" component={HomePage} /> <Route path="/signin" auth={auth} render={props => (<SignIn handleAuth={handleAuth} {...props} /> )} />打开文件时，基本上可以保证这是一个文本文件，其中不包含不是UTF-8的字节。但是，当然，此保证不能用于PDF文件-它是一种二进制格式，可能会或可能不会在 -8中包含字符串。但这不是您阅读的方式。

如果您有权访问读取PDF并提取文本字符串的库，则可以

open(..., 'r', encoding='utf-8')

更现实的是，但是以一种更加行人的方式，您可以将PDF文件读取为二进制文件，并寻找编码后的文本。

# Dunno if such a library exists, but bear with ...
instance = myFantasyPDFlibrary('file.pdf')
for text_snippet in instance.enumerate_texts_in_PDF():
    if 'API No.:\n' in text_snippet:
        api = text_snippet.split('API No.:\n')[1].split('\n')[0].split('"')[0].strip()

一个粗略的解决方法是对Python进行编码谎言，并声称它实际上是Latin-1。这种特殊的编码具有吸引人的功能，即每个字节都精确地映射到其自己的Unicode代码点，因此您可以将二进制数据读取为文本并摆脱它。但是，然后，当然，任何实际的UTF-8都将转换为mojibake（例如，with open('file.pdf', 'rb') as pdf: pdfbytes = pdf.read() if b'API No.:\n' in pdfbytes: api_text = pdfbytes.split(b'API No.:\n')[1].split(b'\n')[0].decode('utf-8') api = api_text.split('"')[0].strip()将呈现为"hëlló"）。通过将文本转换回字节，然后以正确的编码（"hÃ«llÃ³"对其进行解码，可以提取实际的UTF-8文本。

Answer 2

该问题可能是由于您的计算机名称，我在Python Django框架

中遇到此错误

解决方案是“ 您的计算机名称不能包含特殊字符” ，请检查并更改您的计算机名称... Changing computer name image

Answer 3

PDF文件以字节存储。因此，要读写PDF文件，您需要使用rb或wb。

with open(file, 'rb') as fopen:
    q = fopen.read()
    print(q.decode())

'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte可能是由于your editor或PDF未经过utf编码（通常）引起的。

因此使用

with open(file, 'rb') as fopen:
        q = fopen.read()
        print(q.decode('latin-1')) #or any encoding which is suitable here.

如果您的editor console不兼容，那么您将也看不到任何输出。

注意：使用encoding时不能使用rb参数，因此必须在读取文件后进行解码。

'utf-8'编解码器无法解码字节0xe2：无效的连续字节错误

3 个答案: