python strip函数不能正常工作

时间:2014-03-28 08:40:17

标签: python web-scraping strip

我正在通过python从网站上删除一些数据。

我想做两件事

  1. 我想跳过前两个字"迪拜"和#34;阿联酋"这在每个网页结果中都很常见。
  2. 我想将最后两个单词保存在带有条带的两个不同变量中,而不需要额外的空格。

        try:
            area= soup.find('div', 'location')
            area_result= str(area.get_text().strip().encode("utf-8"))
            print "Area: ",area_result
    except StandardError as e:
            area_result="Error was {0}".format(e)
            print area_result
    
  3. area_result:包含以下数据:

    'UAE \xe2\x80\xaa>\xe2\x80\xaa\n            \n                Dubai \xe2\x80\xaa>\xe2\x80\xaa\n            \n                Business Bay \xe2\x80\xaa>\xe2\x80\xaa\n            \n                Executive Towers \n            \n\n\n        \n\n\n\t    \n\t        \n\t    \n\t\n\n\n        \n        ;\n        \n            \n                \n                    1.4 km from Burj Khalifa Tower'
    

    我希望上面的结果显示为(注意>Executive Towers之间的1.4 km..

    Executive Towers > 1.4 km from Burj Khalifa Tower
    

2 个答案:

答案 0 :(得分:2)

area_result = area_result.replace("UAE", "")
area_result = area_result.replace("Dubai", "")
area_result =  area_result.strip()

使用正则表达式:

import re
area_result = re.sub('\s+',' ',area_result)
area_result = area_result.replace("UAE ‪>‪ Dubai ‪>‪", "")
area_result =  area_result.strip()

答案 1 :(得分:0)

import string
def cleanup(s, remove=('\n', '\t')):
    newString = ''
    for c in s:
        # Remove special characters defined above.
        # Then we remove anything that is not printable (for instance \xe2)
        # Finally we remove duplicates within the string matching certain characters.
        if c in remove: continue
        elif not c in string.printable: continue
        elif len(newString) > 0 and c == newString[-1] and c in ('\n', ' ', ',', '.'): continue
        newString += c
    return newString

为了清理代码,在那里扔东西? 最终结果是:

>>> s = 'UAE \xe2\x80\xaa>\xe2\x80\xaa\n            \n                Dubai \xe2\x80\xaa>\xe2\x80\xaa\n            \n                Business Bay \xe2\x80\xaa>\xe2\x80\xaa\n            \n                Executive Towers \n            \n\n\n        \n\n\n\t    \n\t        \n\t    \n\t\n\n\n        \n        ;\n        \n            \n                \n                    1.4 km from Burj Khalifa Tower'
>>> cleanup(s)
'UAE > Dubai > Business Bay > Executive Towers 1.4 km from Burj Khalifa Tower'

这是对string库的良好SO引用。

回到问题是看到用户不希望前两个区块(>之间)存在,非常简单:

area_result = cleanup(area_result).split('>')[3].replace(';', '>')