从python输出中删除unicode格式化的日语字符串

时间:2014-05-02 17:58:44

标签: python unicode substring

我有一个脚本从网络上收集一些文本元素;有问题的内容是机器翻译的,并留下一些原始语言和英语混合的残余。我想剥去任何非拉丁字符,但我还没能找到一个好的子字母。以下是字符串和所需输出的示例:我想删除它:\u30e6\u30fc\u30ba\u30c9但保留其他所有内容。 >>我想删除它:但保留其他所有内容。

这是我目前用来演示问题的代码

import requests
from lxml import html
from pprint import pprint
import os
import re
import logging

header = {'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36', 'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Language' : 'en-US,en;q=0.8', 'Cookie' : 'search_layout=grid; search.ab=test-A' }
# necesary to perform the http get request

def main():
    # get page content
    response = requests.get('http://global.rakuten.com/en/store/wanboo/item/w690-3/', headers=header)
    # return parsed body for the lxml module to process
    parsed_body = html.fromstring(response.text)
    # get the title tag
    dirtyname = unicode(parsed_body.xpath("//h1[contains(@class, 'b-ttl-main')]/text()"))
    # test that this tag returns undesired unicode output for the japanese characters
    print dirtyname
    # attempt to clean the unicode using a custom filter to remove any characters in this paticular range
    clean_name = ''.join(filter(lambda character:ord(character) < 0x3000, unicode(dirtyname)))
    # output of the filter should return no unicode characters but currently does not
    print clean_name
    # the remainder of the script is uncessary for the problem in question so I have removed it

if __name__ == '__main__':
    main()

1 个答案:

答案 0 :(得分:1)

''.join(filter(lambda character:ord(character) < 0x3000,my_unicode_string))

我认为会有用......

或者您可能希望限制字节大小字符

 ''.join(filter(lambda character:ord(character) < 0xff,my_unicode_string))

基本上很容易过滤掉你想要的任何范围......(实际上它可以安全地过滤掉codepoint < 0x100

例如

>>> test_text = u'\u30e62\u30fcX\u30ba\u30c9T'
>>> ''.join(filter(lambda character:ord(character) < 0x3000,test_text))
u'2XT'

关于您的问题中链接的问题

dirtyname = parsed_body.xpath() ... #this returns a list ... not a string so we will put out own list as a stand in to demonstrate the issue


dirtyname = [u"hello\u2345world"]

然后你在该列表上调用unicode

dirtyname = unicode(dirtyname)

现在如果您按照我在评论中建议的那样打印repr,您会看到

>>> print repr(dirtyname)
u'[u"Hello\\u2345world"]' 
>>> for item in dirtyname:
...    print item
[
u
"
H
#and so on 

现在注意它只是一个字符串...它不是一个列表,字符串中没有unicode字符,因为反斜杠是转义的

你可以通过简单地获取数组中的元素而不是整个数组来轻松解决这个问题.... parsed_body.xpath(...)[0]

>>> dirtyname = parsed_body.xpath("//h1[contains(@class, 'b-ttl-main')]/text()")[0]
>>> #notice that we got the unicode element that is in the array
>>> print repr(dirtyname)
u"Hello\u2345world"
>>> cleanname =  ''.join(filter(lambda character:ord(character) < 0x3000, dirtyname))
>>> print repr(clean_name)
u"Helloworld" 
>>> #notice that everything is correctly filtered