我正在通过python从网站上删除一些数据。
我想做两件事
我想将最后两个单词保存在带有条带的两个不同变量中,而不需要额外的空格。
try:
area= soup.find('div', 'location')
area_result= str(area.get_text().strip().encode("utf-8"))
print "Area: ",area_result
except StandardError as e:
area_result="Error was {0}".format(e)
print area_result
area_result:包含以下数据:
'UAE \xe2\x80\xaa>\xe2\x80\xaa\n \n Dubai \xe2\x80\xaa>\xe2\x80\xaa\n \n Business Bay \xe2\x80\xaa>\xe2\x80\xaa\n \n Executive Towers \n \n\n\n \n\n\n\t \n\t \n\t \n\t\n\n\n \n ;\n \n \n \n 1.4 km from Burj Khalifa Tower'
我希望上面的结果显示为(注意>
和Executive Towers
之间的1.4 km..
Executive Towers > 1.4 km from Burj Khalifa Tower
答案 0 :(得分:2)
area_result = area_result.replace("UAE", "")
area_result = area_result.replace("Dubai", "")
area_result = area_result.strip()
使用正则表达式:
import re
area_result = re.sub('\s+',' ',area_result)
area_result = area_result.replace("UAE > Dubai >", "")
area_result = area_result.strip()
答案 1 :(得分:0)
import string
def cleanup(s, remove=('\n', '\t')):
newString = ''
for c in s:
# Remove special characters defined above.
# Then we remove anything that is not printable (for instance \xe2)
# Finally we remove duplicates within the string matching certain characters.
if c in remove: continue
elif not c in string.printable: continue
elif len(newString) > 0 and c == newString[-1] and c in ('\n', ' ', ',', '.'): continue
newString += c
return newString
为了清理代码,在那里扔东西? 最终结果是:
>>> s = 'UAE \xe2\x80\xaa>\xe2\x80\xaa\n \n Dubai \xe2\x80\xaa>\xe2\x80\xaa\n \n Business Bay \xe2\x80\xaa>\xe2\x80\xaa\n \n Executive Towers \n \n\n\n \n\n\n\t \n\t \n\t \n\t\n\n\n \n ;\n \n \n \n 1.4 km from Burj Khalifa Tower'
>>> cleanup(s)
'UAE > Dubai > Business Bay > Executive Towers 1.4 km from Burj Khalifa Tower'
这是对string库的良好SO引用。
回到问题是看到用户不希望前两个区块(>
之间)存在,非常简单:
area_result = cleanup(area_result).split('>')[3].replace(';', '>')