Question

我试图在网址

中找到子字符串'foro.enfemenino.com'

str2 = 'http://foro.enfemenino.com/forum/f166/__f22092_f166-Servicio-tecnico-philips-en-castelldefells-tel-900-100-137.html#25144'

使用此代码：

            result = BeautifulSoup(urllib.urlopen(url+str(x)))
            for link in result.find_all('a', class_= "title"):
            m = r.search(link['onmouseover'])
            str1 = m.group(1)
            print str1
            str2 = str1.encode('utf8')
            parsed_uri = urlparse(str2)

            domain = '{}://{}/'.format( parsed_uri[0], parsed_uri[1] )
            print domain

当str2采用unicode格式时，我输出为':///'。相反，如果我只是从终端复制字符串并将其粘贴在

中

parsed_uri = urlparse('http://foro.enfemenino.com/forum/f166/__f22092_f166-Servicio-tecnico-philips-en-castelldefells-tel-900-100-137.html#25144'),

我完全得到'foro.enfemenino.com'

我把字符串作为来自另一个函数的输入，尽管做str2.encode('utf8')或codec.encode(str2,'utf-8')我得到相同的结果。

我该如何解决这个问题？

编辑：

 r = re.compile('window.status=(.*?); return true;')

尽管转换为utf8，但uncode字符串不能与urlparse一起使用

0 个答案: