Python: how to get rid of non-ascii characters being read from a file

时间:2017-07-12 07:56:46

标签: python encoding replace data-processing digraphs

I am processing, with python, a long list of data that looks like this

data screenshot

The digraphs are probably due to encoding problems. (I am not sure whether these characters will be preserved in this site)

29/07/2016 04:00:12 0.125143    

Now, when I read such file into a script using something like open and readlines, there is an error, reading

SyntaxError: EOL while scanning string literal

I know (or may look up usage of) replace and regex functions, but I cannot do them in my script. The biggest problem is that anywhere I include or read such strange character, error occurs, pointing on the very line it is read. So I cannot do anything to them.

2 个答案:

答案 0 :(得分:1)

Are you reading a file? If so, try to extract values using regexps, not to remove extra characters:

re.search(r'^([\d/: ]{19})', line).group(1)
re.search(r'([\d.]{7})', line).group(1)

答案 1 :(得分:0)

我发现re.findall有效。 (对不起,我没有时间测试所有其他方法,因为这项工作的意义已经消失,我甚至忘记了这个问题。)

def extract_numbers(str_i):
   pat="(\d+)/(\d+)/(\d+)\D*(\d+):(\d+):(\d+)\D*(\d+)\.(\d+)"
   match_h = re.findall(pat, str_i)
   return match_h[0]

# ....
# `f` is the handle of the file in question
lines =f.readlines()
for l in lines:
   ls_f =extract_numbers(l)
   # process them....