蟒蛇。通过网址

时间:2017-12-17 13:37:43

标签: python encoding utf-8

我需要通过URL获取文件并返回包含此文件中最多单词数的字符串。这是我的代码:

from urllib.request import urlopen

def wordiest_line(url):
    data = urlopen(url)

    if data:
        max_words = 0
        max_line = ""
        for line in data.readlines(): 
            #print(line)
            the_encoding = "utf-8"
            line = line.decode(the_encoding)
            line = line.rstrip()
            line_words = line.split()
            if len(line_words) > max_words:
                max_words = len(line_words)
                max_line = line

        #print("%s to RETURN\n" % max_line)
        return max_line

    else:
        return None

这些是用于测试此功能的一些URL:

  1. http://math-info.hse.ru/f/2017-18/dj-prog/lines1.txt
  2. http://lib.ru/FOUNDATION/3laws.txt_Ascii.txt
  3. http://math-info.hse.ru/f/2017-18/dj-prog/lines2.txt
  4. 对于链接1和3,它工作正常。 但是wordiest_line("http://lib.ru/FOUNDATION/3laws.txt_Ascii.txt") 由于文件编码不能正常工作,有一些西里尔文本。

    我尝试定义了什么字符串编码并对其进行解码。这是代码:

    from urllib.request import urlopen
    import chardet    
    
    def wordiest_line(url):
        data = urlopen(url)
    
        if data:
            max_words = 0
            max_line = ""
            for line in data.readlines(): 
                #print(line)
                the_encoding = chardet.detect(line)['encoding']
                line = line.decode(the_encoding)
                #print(the_encoding, line)
                line = line.rstrip()
                line_words = line.split()
                if len(line_words) > max_words:
                    max_words = len(line_words)
                    max_line = line
    
            #print("%s to RETURN\n" % max_line)
            return max_line
    
        else:
            return None
    

    现在wordiest_line("http://lib.ru/FOUNDATION/3laws.txt_Ascii.txt")失败并出现错误:'charmap'编解码器无法解码位置8中的字节0xdc:字符映射到未定义

    其他网址仍可正常使用。你有什么建议如何解决它?

2 个答案:

答案 0 :(得分:1)

The help?> Core.sizeof No documentation found. Core.sizeof is a Function. # 0 methods for generic function "sizeof": library can be a life-saver if you have to guess or fix encoding of messy input. However, in your case this information is given – at least for the lib.ru example. As expected from any well-behaving server, the charset of a plain-text response is specified in the "Content-Type" header:

chardet

Note: I assume that you are using Python 3; the above code won't work in Python 2. Also, I suggest you decode the content before iterating over the lines of the file, assuming you won't be given broken input like badly messed-up files with differently encoded lines.

Second note: the requests library will probably allow you to write less boiler-plate code for this task.

Third note: For counting words, import codecs from urllib.request import urlopen def wordiest_line(url): resp = urlopen(url) charset = resp.headers.get_content_charset() textreader = codecs.getreader(charset)(resp) for line in textreader: line = line.rstrip() # continue with tokenising and counting... is rather simplistic. For example, "argue," and "argue" will be considered different words, and you might even want to define "arguing" and "argued" as belonging to the same word. In that case, you'll have to use an NLP library, such as NLTK or SpaCy.

答案 1 :(得分:0)

Python允许你使用`decode(encoding,'replace')进行容错解码,用正式的U + FFFD REPLACEMENT CHARACTER替换任何有问题的字符。

如果不确定编码(如果@lenz提出的解决方案不方便),你应该使用:

        line = line.decode(the_encoding, 'replace')

即使使用utf8编码,它也能识别正确的行,但当然无法正确解码

或者你可以使用Latin1编码将任何字节转换为相同代码值的unicode字符的事实。在这里你可以做到:

        try:
            line = line.decode(the_encoding)
        except UnicodeDecodeError:
            line = line.decode('Latin1')

这不仅可以正确识别正确的行,还可以:

line = wordiest_line("http://lib.ru/FOUNDATION/3laws.txt_Ascii.txt")
orig = line.encode('Latin)

你得到原始字节的原始行,可以检查它现在正确解码它。

BTW,文件的正确编码是KOI8-R