从文件

时间:2018-03-24 16:55:02

标签: python-2.7 ascii non-ascii-characters python-unicode non-unicode

我知道这是一个重复的问题,但到目前为止,我已经非常努力地尝试了所有解决方案。任何人都可以帮助如何从文件中删除像\ xc3 \ xa2 \ xc2 \ x84 \ xc2 \ xa2这样的字符?

我目前要清理的文件内容是: b'烤洋葱蘸酱," b"" [' 2磅大黄洋葱,切成薄片', 3大葱,薄切片& #39;,' 4枝百里香',' 1/4杯橄榄油''洁净盐和新鲜黑胡椒',#1; 1杯白葡萄酒' 2汤匙香槟醋' 2杯酸奶油' 1/2杯切碎的新鲜韭菜',' 1 / 4杯普通希腊酸奶','所有调味料和百里香配菜' Cape Cod Waves \ xc3 \ xa2 \ xc2 \ x84 \ xc2 \ xa2马铃薯片供应' ]"""

我尝试过使用re.sub(' [^ \ x00- \ x7F] +',''''',whatevertext)但似乎无处可去。我怀疑\这里没有被视为特殊角色。

1 个答案:

答案 0 :(得分:1)

您可以这样做:

>>> f = open("test.txt","r")
>>> whatevertext = f.read()
>>> print whatevertext
b'Roasted Onion Dip',"b""['2 pounds large yellow onions, thinly sliced', '3 large shallots, thinly sliced', '4 sprigs thyme', '1/4 cup olive oil', 'Kosher salt and freshly ground black pepper', '1 cup white wine', '2 tablespoons champagne vinegar', '2 cups sour cream', '1/2 cup chopped fresh chives', '1/4 cup plain Greek yogurt', 'Everything seasoning and thyme to garnish', 'Cape Cod Waves\xc3\xa2\xc2\x84\xc2\xa2 Potato Chips for serving']"""

>>> import re
>>> result = re.sub('\\\\x[a-f|0-9]+','',whatevertext)
>>> print result
b'Roasted Onion Dip',"b""['2 pounds large yellow onions, thinly sliced', '3 large shallots, thinly sliced', '4 sprigs thyme', '1/4 cup olive oil', 'Kosher salt and freshly ground black pepper', '1 cup white wine', '2 tablespoons champagne vinegar', '2 cups sour cream', '1/2 cup chopped fresh chives', '1/4 cup plain Greek yogurt', 'Everything seasoning and thyme to garnish', 'Cape Cod Waves Potato Chips for serving']"""

>>> 

' \\×〔A-F | 0-9] +'在这个正则表达式中,每个斜杠都用斜杠转义,在x后我们知道可以有0-9的数字或a-f的字母。