在Python中,如何最有效地将UTF-8字符串分块以进行REST交付?

时间:2014-03-28 16:54:49

标签: python string rest unicode utf-8

  1. 我首先要说的是我有点理解'UTF-8'编码是什么,它基本上但不完全是unicode,并且ASCII是一个较小的字符集。我也明白,如果我有:

    se_body = "&gt; Genesis 2:2 וַיְכַל אֱלֹהִים בַּיֹּום הַשְּׁבִיעִי מְלַאכְתֹּו אֲשֶׁר עָשָׂה וַיִּשְׁבֹּת בַּיֹּום הַשְּׁבִיעִי מִכָּל־מְלַאכְתֹּו אֲשֶׁר עָשָֽׂה׃ The word tr <excess removed ...> JV"
    print len(se_body)              #will return the number of characters in the string, in my case '1500'
    print sys.getsizeof(se_body)    #will return the number of bytes, which will be 3050
    
  2. 我的代码正在利用我无法控制的RESTful API。 RESTful API的工作是从文本中解析传递的圣经引用参数,并且有一个有趣的怪癖 - 它一次只能接受2000个字符。如果发送的字符超过2000个,我的API调用将返回404.再次强调,我正在利用其他人的API,所以请不要告诉我“修复服务器端”。我不能:))

  3. 我的解决方案是取字符串并将其分块,小于2000个字符,让它扫描每个块,然后根据需要重新组装和标记。我希望善待服务并尽可能少地传递,这意味着每个块应该很大。

  4. 当我传递一个包含希伯来语或希腊字符的字符串时,我的问题出现了。 (是的,圣经的答案经常使用希腊语和希伯来语!)如果我将块大小设置为低至1000个字符,我总能安全地通过它,但这看起来真的很小。在大多数情况下,我应该能够把它变得更大。

  5. 我的问题是:如果不采用太多的英雄主义,那么将UTF-8打造成正确尺寸的最有效方法是什么?

  6. 以下是代码:

    # -*- coding: utf-8 -*-
    import requests
    import json
    
    biblia_apikey = '************'
    refparser_url = "http://api.biblia.com/v1/bible/scan/?"
    se_body = "&gt; Genesis 2:2 וַיְכַל אֱלֹהִים בַּיֹּום הַשְּׁבִיעִי מְלַאכְתֹּו אֲשֶׁר עָשָׂה וַיִּשְׁבֹּת בַּיֹּום הַשְּׁבִיעִי מִכָּל־מְלַאכְתֹּו אֲשֶׁר עָשָֽׂה׃ The word translated as &quot;rest&quot; in English, is actually the conjugated word from which we get the English word `Sabbath`, which actually means to &quot;cease doing&quot;. &gt; וַיִּשְׁבֹּת or by its root: &gt; שָׁבַת Here&#39;s BlueletterBible&#39;s concordance entry: [Strong&#39;s H7673][1] It is actually the same root word that is conjugated to mean &quot;[to go on strike][2]&quot; in modern Hebrew. In Genesis it is used to refer to the fact that the creation process ceased, not that God &quot;rested&quot; in the sense of relieving exhaustion, as we would normally understand the term in English. The word &quot;rest&quot; in that sense is &gt; נוּחַ Which can be found in Genesis 8:9, for example (and is also where we get Noah&#39;s name). More here: [Strong&#39;s H5117][3] Jesus&#39; words are in reference to the fact that God is always at work, as the psalmist says in Psalm 54:4, He is the sustainer, something that implies a constant intervention (a &quot;work&quot; that does not cease). The institution of the Sabbath was not merely just so the Israelites would &quot;rest&quot; from their work but as with everything God institutes in the Bible, it had important theological significance (especially as can be gleaned from its prominence as one of the 10 commandments). The point of the Sabbath was to teach man that he should not think he is self-reliant (cf. instances such as Judges 7) and that instead they should rely upon God, but more specifically His mercy. The driving message throughout the Old Testament as well as the New (and this would be best extrapolated in c.se) is that man cannot, by his own efforts (&quot;works&quot;) reach God&#39;s standard: &gt; Ephesians 2:8 For by grace you have been saved through faith, and that not of yourselves; it is the gift of God, 9 not of works, lest anyone should boast. The Sabbath (and the penalty associated with breaking it) was a way for the Israelites to weekly remember this. See Hebrews 4 for a more in depth explanation of this concept. So there is no contradiction, since God never stopped &quot;working&quot;, being constantly active in sustaining His creation, and as Jesus also taught, the Sabbath was instituted for man, to rest, but also, to &quot;stop doing&quot; and remember that he is not self-reliant, whether for food, or for salvation. Hope that helps. [1]: http://www.blueletterbible.org/lang/lexicon/lexicon.cfm?Strongs=H7673&amp;t=KJV [2]: http://www.morfix.co.il/%D7%A9%D7%91%D7%99%D7%AA%D7%94 [3]: http://www.blueletterbible.org/lang/lexicon/lexicon.cfm?strongs=H5117&amp;t=KJV"
    
    se_body = se_body.decode('utf-8')
    
    nchunk_start=0
    nchunk_size=1500
    found_refs = []
    
    while nchunk_start < len(se_body):
        body_chunk = se_body[nchunk_start:nchunk_size]
        if (len(body_chunk.strip())<4):
            break;
    
        refparser_params = {'text': body_chunk, 'key': biblia_apikey }
        headers = {'content-type': 'text/plain; charset=utf-8', 'Accept-Encoding': 'gzip,deflate,sdch'}
        refparse = requests.get(refparser_url, params = refparser_params, headers=headers)
    
        if (refparse.status_code == 200):
            foundrefs = json.loads(refparse.text)
            for foundref in foundrefs['results']:
                foundref['textIndex'] += nchunk_start
                found_refs.append( foundref ) 
        else:
            print "Status Code {0}: Failed to retrieve valid parsing info at {1}".format(refparse.status_code, refparse.url)
            print "  returned text is: =>{0}<=".format(refparse.text)
    
        nchunk_start += (nchunk_size-50)
        #Note: I'm purposely backing up, so that I don't accidentally split a reference across chunks
    
    
    for ref in found_refs:
        print ref
        print se_body[ref['textIndex']:ref['textIndex']+ref['textLength']]
    

    我知道如何切割字符串(body_chunk = se_body[nchunk_start:nchunk_size]),但我不确定如何根据UTF-8位的长度切片相同的字符串。

    当我完成后,我需要提取所选的引用(我实际上是要添加SPAN标记)。这就是现在输出的样子:

    {u'textLength': 11, u'textIndex': 5, u'passage': u'Genesis 2:2'}
    Genesis 2:2
    {u'textLength': 11, u'textIndex': 841, u'passage': u'Genesis 8:9'}
    Genesis 8:9
    

1 个答案:

答案 0 :(得分:2)

可能有几种尺寸:

  1. sys.getsizeof()返回的内存大小,例如

    >>> import sys
    >>> sys.getsizeof(b'a')
    38
    >>> sys.getsizeof(u'Α')
    56
    

    ,即包含单个字节b'a'的字节字符串在内存中可能需要38个字节。
    除非您的本地计算机存在内存问题,否则您不应该关心它

  2. 编码为utf-8的文本中的字节数:

    >>> unicode_text = u'Α' # greek letter
    >>> bytestring = unicode_text.encode('utf-8')
    >>> len(bytestring)
    2
    
  3. 文本中的Unicode代码点数:

    >>> unicode_text = u'Α' # greek letter
    >>> len(unicode_text)
    1
    

    通常,您可能还会对文本中的字形数组(“视觉字符”)感兴趣:

    >>> unicode_text = u'ё' # cyrillic letter
    >>> len(unicode_text) # number of Unicode codepoints
    2
    >>> import regex # $ pip install regex
    >>> chars = regex.findall(u'\\X', unicode_text)
    >>> chars
    [u'\u0435\u0308']
    >>> len(chars) # number of "user-perceived characters"
    1
    
  4. 如果API限制由p定义。 2(utf-8编码的字节串中的字节数)然后您可以使用the question linked by @Martijn Pieters的答案:Truncating unicode so it fits a maximum size when encoded for wire transfer。第一个答案应该有效:

    truncated = unicode_text.encode('utf-8')[:2000].decode('utf-8', 'ignore')
    

    长度也有可能受网址长度限制:

    >>> import urllib
    >>> urllib.quote(u'\u0435\u0308'.encode('utf-8'))
    '%D0%B5%CC%88'
    

    截断它:

    import re
    import urllib
    
    urlencoded = urllib.quote(unicode_text.encode('utf-8'))[:2000]
    # remove `%` or `%X` at the end
    urlencoded = re.sub(r'%[0-9a-fA-F]?$', '', urlencoded) 
    truncated = urllib.unquote(urlencoded).decode('utf-8', 'ignore')
    

    使用'X-HTTP-Method-Override' http标头可以解决网址长度的问题,如果服务支持请求,则可以将GET请求转换为POST请求。这是code example that uses Google Translate API

    如果您的情况允许,您可以通过解码html引用并使用NFC Unicode规范化表单来组合某些Unicode代码点来压缩html文本:

    import unicodedata
    from HTMLParser import HTMLParser
    
    unicode_text = unicodedata.normalize('NFC', HTMLParser().unescape(unicode_text))