Question

代码要做什么

我试图从给定的文件夹中读取每个文件，并使用python中的bs4 Soup包提取一些行。

我在读取某些Unicode字符无法读取的文件时遇到错误。

错误

回溯（最近一次通话最后一次）：文件“ C：----- \ check.py”，第25行，在 soup = BeautifulSoup（text.read（），'html.parser'）文件“ C：\ Python \ Python37 \ lib \ encodings \ cp1252.py”，第23行，在解码中返回codecs.charmap_decode（input，self.errors，decoding_table）[0] UnicodeDecodeError：“ charmap”编解码器无法解码位置的字节0x9d 3565：字符映射到

   from bs4 import BeautifulSoup
   from termcolor import colored
   import re, os

   import requests
   path = "./brian-work/"
   freddys_library = os.listdir(path)
def getfiles():
  for r, d, f in os.walk(path):
    for file in f:
        if '.html' in file:
            files.append(os.path.join(r, file))
  return files


for book in getfiles():
    print("file is printed")
    print(book)
    text = open(book, "r")
    soup=BeautifulSoup(text.read(), 'html.parser')
    h1 = soup.select('h1')[0].text.strip()
    print(h1)
    if soup.find('h1'):
      h1 = soup.select('h1')[0].text.strip()
    else:

      print ("no h1")
      continue

    filename1=book.split("/")[-1]
    filename1=filename1.split(".")[0]
    print(h1.split(' ', 1)[0])
    print(filename1) 
    if h1.split(' ', 1)[0].lower() == filename1.split('-',1)[0] :
      print('+++++++++++++++++++++++++++++++++++++++++++++');
      print('same\n');
    else:
      print('XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX');
      print('not')
      count=count+1

请帮助我在此进行更正。

谢谢

Answer 1

问题是打开文件时不知道其编码。每个open文档的text = open(book, "r")的默认编码是从locale.getpreferredencoding(False)返回的值，对于您的系统是cp1252。该文件是其他编码，因此失败。

使用text = open(book, "rb")（二进制模式），让BeautifulSoup找出答案。 HTML文件通常会指示其编码。

您也可以使用text = open(book,encoding='utf8')或任何正确的编码（如果您已经知道的话）。

读取文件时卡住

1 个答案: