编码来自Google搜索的阿拉伯语结果

时间:2015-10-12 14:34:13

标签: python encoding

我已经写了这个功能来从谷歌搜索中获取前10个结果:

def google_search(self,query):
    """
        This function returns the urls of top 10  of google search result for a keyword
    """
    params = {'q':query}
    url = 'https://www.google.com/search?'+urllib.urlencode(params)
    result = urlfetch.fetch(url=url)
    content = result.content
    soup = BeautifulSoup(content)
    list = soup.findAll("li", {'class':'g'})
    urls = []
    for item in list:
        link = item.findAll('a')[0]
        url = 'https://www.google.com'+link['href']
        urls.append(url.encode('utf-8'))
    return urls

然后我写了另一个功能,找到基于谷歌搜索的相关wikepedia文章

def wikipedia_search(self,query,language='en'):
    """
        This function returns a list of urls and title of top wikepedia search result for a keyword
    """
    q = query+u' site:%s.wikipedia.org' %language
    urls = self.google_search(q.encode('utf-8'))
    list =[]
    for url in urls:
        title = re.findall(r'/wiki/(.*)&s',url.encode('utf-8'))[0].replace("_"," ")
        link = re.findall(r'q=(.*)&s',url)[0]
        url_tag = {'url':link ,'title' :title}
        list.append(url_tag)
    return list

但是当我尝试用阿拉伯语进行搜索时,我得到的结果如下: {'title':'%25D8%25AD%25D9%2583%25D9%2588%25D9%2585%25D8%25A9','url':'https://ar.wikipedia.org/wiki/%25D8%25AD%25D9%2583%25D9%2588%25D9%2585%25D8%25A9'},{'title':'%25D8% 25A8%25D9%258A%25D8%25AA%25D9%2588%25D9%258A%25D9%2586%25D8%25AF%25D8%25B3%25D9%2588%25D8%25B1','url':'https://ar.wikipedia.org/wiki/%25D8%25A8%25D9%258A%25D8%25AA_%25D9%2588%25D9%258A%25D9%2586%25D8%25AF%25D8%25B3%25D9%2588%25D8%25B1' } 基本上我无法探索。

1 个答案:

答案 0 :(得分:0)

数据是使用URL引用转义的UTF-8编码字节,因此您要解码:

url = urllib.unquote(url).decode(' utf8')

演示:

>>> import urllib 
>>> url='example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0'
>>> urllib.unquote(url).decode('utf8') 
u'example.com?title=\u043f\u0440\u0430\u0432\u043e\u0432\u0430\u044f+\u0437\u0430\u0449\u0438\u0442\u0430'
>>> print urllib.unquote(url).decode('utf8')
example.com?title=правовая+защита

(直接引自Url decode UTF-8 in Python,因为我无法发表评论)