无法从打印中删除转义字符

时间:2014-01-29 16:03:40

标签: python escaping beautifulsoup html-escape-characters

您好我正在尝试提取信息以放入包含纯文本的列表但无法找到删除转义字符的方法。

我对python和编程很新。我一直试图解决这个问题但却无法找到。

这是我的代码:

import urllib
import re
from bs4 import BeautifulSoup


x=1
while x<2:

    url = "http://search.insing.com/ts/food-drink/bars-pubs/bars-pubs?page=" +str(x)
    htmlfile = urllib.urlopen(url).read()
    soup = BeautifulSoup(htmlfile.decode('utf-8','ignore'))
    reshtml = soup.find("div", "results").find_all("h3")
    reslist = []
    for item in reshtml:
            res = item.get_text()
            reslist.append(res)

    print reslist
    x += 1

2 个答案:

答案 0 :(得分:1)

好像你真的在那里的主播文本之后,考虑改变

reshtml = soup.find("div", "results").find_all("h3")

为:

reshtml = [h3.a for h3 in soup.find("div", "results").find_all("h3")]

也改变了:

reslist.append(res)

为:

reslist.append(' '.join(res.split()))

这是我改变之后得到的:

[u'Parco Caffe', u'AdstraGold Microbrewery & Bistro Bar', 
 u'Alkaff Mansion Ristorante', u'The Fat Cat Bistro', u'Gravity Bar', 
 u'The Wine Company (Evans Road)', u'Serenity Spanish Bar & Restaurant (VivoCity)', 
 u'The New Harbour Cafe & Bar', u'Indian Times', u'Sunset Bay Beach Bar',  
 u'Friends @ Jelita', u'Talk Cock Sing Song @ Thomson',  
 u'En Japanese Dining Bar (UE Square)', u'Magma German Wine Bistro',  
 u"Tam Kah Shark's Fin", u'Senso Ristorante & Bar',  
 u'Hard Rock Cafe (HPL House)', u'St. James Power Station',  
 u'The St. James', u'Brotzeit German Bier Bar & Restaurant (Vivocity)']

答案 1 :(得分:0)

当前输出如下:

[u'\n\r\n                Parco Caffe\n', 
 u'\n\r\n                AdstraGold Microbrewery & Bistro Bar\n', 
 u'\n\r\n                Alkaff Mansion Ristorante\n', 
 u'\n\r\n                The Fat Cat Bistro\n', 
 u'\n\r\n                Gravity Bar\n', 
 u'\n\r\n                The Wine Company\r\n                    (Evans Road)\r\n                \n', 
 u'\n\r\n                Serenity Spanish Bar & Restaurant\r\n                    (VivoCity)\r\n                \n', 
 u'\n\r\n                The New Harbour Cafe & Bar\n', 
 u'\n\r\n                Indian Times\n', 
 u'\n\r\n                Sunset Bay Beach Bar\n', 
 u'\n\r\n                Friends @ Jelita\n', 
 u'\n\r\n                Talk Cock Sing Song @ Thomson\n', 
 u'\n\r\n                En Japanese Dining Bar\r\n                    (UE Square)\r\n                \n', 
 u'\n\r\n                Magma German Wine Bistro\n', 
 u"\n\r\n                Tam Kah Shark's Fin\n", 
 u'\n\r\n                Senso Ristorante & Bar\n', 
 u'\n\r\n                Hard Rock Cafe\r\n                    (HPL House)\r\n                \n', 
 u'\n\r\n                St. James Power Station \n', 
 u'\n\r\n                The St. James\n', 
 u'\n\r\n                Brotzeit German Bier Bar & Restaurant\r\n                    (Vivocity)\r\n                \n']

在打印前添加这些行:

reslist = [y.replace('\n','').replace('\r','') for y in reslist]
reslist = [y.strip() for y in reslist]

给我这个输出:

[u'Alkaff Mansion Ristorante', 
 u'Parco Caffe', 
 u'AdstraGold Microbrewery & Bistro Bar', 
 u'Gravity Bar', 
 u'The Fat Cat Bistro', 
 u'The Wine Company                    (Evans Road)', 
 u'Serenity Spanish Bar & Restaurant                    (VivoCity)', 
 u'The New Harbour Cafe & Bar', 
 u'Indian Times', 
 u'Sunset Bay Beach Bar', 
 u'Friends @ Jelita', 
 u'Talk Cock Sing Song @ Thomson', 
 u'En Japanese Dining Bar                    (UE Square)', 
 u'Magma German Wine Bistro', 
 u"Tam Kah Shark's Fin", 
 u'Senso Ristorante & Bar', 
 u'Hard Rock Cafe                    (HPL House)', 
 u'St. James Power Station', 
 u'The St. James', 
 u'Brotzeit German Bier Bar & Restaurant                    (Vivocity)']

这就是你要找的东西吗?

盖伊的回答要好得多,而且还有更多的BeautifulSoup。