在Python中搜索和替换分号

时间:2014-08-01 02:23:07

标签: python python-2.7 sed centos6

OS: CentOS 6.5
Python version: 2.7.5

我有一个包含以下信息样本的文件。 我想搜索并替换分号符号并替换为$ 0。盈。

Alpha $1.00
Beta  ¢55  <<<< note
Charlie $2.00
Delta  ¢23  <<<< note

我希望它看起来像这样:

Alpha $1.00
Beta  $0.55  <<<< note
Charlie $2.00
Delta  $0.23  <<<< note

所以命令行中的代码(有效)是:

sed 's/¢/$0./g' *file name*

然而,使用python对其进行编码不起作用:

import subprocess
hello = subprocess.call('cat datafile ' + '| sed "s/¢/$0./g"',shell=True)
print hello

每当我尝试粘贴¢符号时似乎都会出错。

稍微接近一点,当我在Python中打印分号的unicode时,它出现在下面:

print(u"\u00A2")
¢

当我捕捉我的数据文件时,它实际上显示为¢符号,错过了Â。 &LT;&LT;不确定这是否有任何帮助

我认为当我尝试使用Unicode时,在¢之前添加的符号不允许我搜索和替换。

尝试unicode时的错误代码:

hello = subprocess.call(u"cat datafile | sed 's/\uxA2/$0./g'",shell=True)
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 25-26: truncated \uXXXX escape

将uxA2修复为u00A2,我明白了:

sed: -e expression #1, char 7: unknown option to `s'
1

任何想法/想法?

这两个例子我都得到以下错误:

[root@centOS user]# python test2.py
Traceback (most recent call last):
  File "test2.py", line 3, in <module>
    data = data.decode('utf-8')             # decode immediately to Unicode
  File "/usr/local/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa2 in position 6: invalid start byte

[root@centOS user]# python test1.py
Traceback (most recent call last):
  File "test1.py", line 11, in <module>
    hello_unicode = hello_utf8.decode('utf-8')
  File "/usr/local/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa2 in position 6: invalid start byte

这是文件的标记:

[root@centOS user]# cat datafile
alpha ¢79 

这是数据文件的Nano:

alpha �79

这是数据文件的Vim:

[root@centOS user]# vim fbasdf
alpha ¢79
~

再次感谢所有帮助人员

ANSWER !!

Rob和Thomas的SED输出有效。 文件格式保存为charset = iso-8859-1。我无法在文档中搜索utf-8格式字符。

标识文件字符集:

file -bi datafile
text/plain; charset=iso-8859-1

使用以下代码更改文件:

iconv -f iso-8859-1 -t utf8 datafile > datafile1

2 个答案:

答案 0 :(得分:1)

窃取托马斯的答案并扩展它:

import subprocess

# Keep all strings in unicode as long as you can.
cmd_unicode = u"sed 's/\u00A2/$0./g' < datafile"

# only convert them to encoded byte strings when you send them out
# also note the use of .check_output(), NOT .call()
cmd_utf8 = cmd_unicode.encode('utf-8')
hello_utf8 = subprocess.check_output(cmd_utf8, shell=True)

# Decode any incoming byte string to unicode immediately on receipt
hello_unicode = hello_utf8.decode('utf-8')

# And you have your answer
print hello_unicode

上面的代码演示了在外部使用&#34; Unicode三明治&#34;:字节,在内部使用Unicode。见http://nedbatchelder.com/text/unipain.html

对于这个简单的例子,您可以轻松地完成Python中的所有操作:

with open('datafile') as datafile:
    data = datafile.read()              # Read in bytes
data = data.decode('utf-8')             # decode immediately to Unicode
data = data.replace(u'\xa2', u'$0.')    # Do all operations in Unicode
print data                              # Implicit encode during output 

答案 1 :(得分:0)

此外,将您的字符串更改为unicode字符串,并将分号替换为\u00A2

这里是固定代码:

import subprocess
hello = subprocess.call(u"cat datafile | sed \"s#\u00A2#$0.#g\"",shell=True)
print hello