Question

我正在抓取以下tweet中的文字。

.@mikhailaleshin on drivers scared of the #Indy500: "They just have small **. ... That’s the only explanation." -

我正在对网站的源代码进行正则表达式：

page = urllib.urlopen(url).read()
soup = BeautifulSoup(page, "html.parser")
soup_string = str(soup)
tweet_text = re.search(ur'<title*?>(.*)</title>', soup_string).group(1)

但是当我将它打印到我的屏幕上时，我得到了这个：

.@mikhailaleshin on drivers scared of the #Indy500: "They just have small **. ... ThatÔÇÖs the only explanation."

因此引号’变为ÔÇÖ。我最好的选择是它是某种编码问题，但我不知道如何修复它。

Answer 1

re.search(ur'<title*?>(.*)</title>', soup_string, re.U).group(1)

（或）

re.search(ur'<title*?>(.*)</title>', soup.enocde('utf-8'), re.U).group(1)

如果是unicode错误，那么上面的一个应该解决错误。

这是一个解决方法

url = "https://twitter.com/a_s12/status/865229374844481536"
page = urllib.urlopen(url).read()
soup = BeautifulSoup(page, "html.parser")
soup_string = soup.findAll('title')[0].encode('utf=8')
tweet_text = re.search(ur'<title>(.*?)</title>', soup_string, re.U).group(1)
print tweet_text

正则表达式将引号转换为奇怪的符号

1 个答案: