我使用以下命令重新格式化文件,并创建一个新文件:
sed -e '1s/^/[/' -e 's/$/,/' -e '$s/,$/]/' toto> toto.json
它在命令行上运行正常。
我尝试通过python脚本使用它,但它不会创建新文件。
我试试:
subprocess.call(["sed", "-e","1s/^/[/","-e", "s/$/,/","-e","$s/,$/]/ ",sys.argv[1], " > ",sys.argv[2]])
问题是:它给了我stdout中的输出并引发错误:
sed: can't read >: No such file or directory
Traceback (most recent call last):
File "test.py", line 14, in <module>
subprocess.call(["sed", "-e","1s/^/[/","-e", "s/$/,/","-e","$s/,$/]/",
sys.argv[1], ">",sys.argv[2])
File "C:\Users\Anaconda3\lib\subprocess.py", line 291, in
check_call raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sed', '-e', '1s/^/[/', '-e',
's/$/,/', '-e', '$s/,$/]/', 'toto.txt, '>', 'toto.json']' returned non-zero
exit status 2.
我阅读了子进程的其他问题并尝试使用选项shell = True的其他命令但是,它也没有用。 我使用python 3.6
有关信息,该命令在第一行和最后一行添加括号,并在除最后一行之外的每行末尾添加逗号。所以,它确实:
from
a
b
c
为:
[a,
b,
c]
答案 0 :(得分:2)
在Linux和其他Unix系统上,重定向字符不是命令的一部分,而是由shell解释,因此将它作为参数传递给子进程是没有意义的。
希望subprocess.call
允许stdout
参数成为文件对象。所以你应该这样做:
subprocess.call(["sed", "-e","1s/^/[/","-e", "s/$/,/","-e","$s/,$/]/ ",sys.argv[1]],
stdout=open(sys.argv[2], "w"))
答案 1 :(得分:0)
如果您使用的是Python,请使用pythonic Python脚本。
类似的东西:
<md-list-item view-employee employee="employee" ng-click="viewEmployee($index)" ng-repeat="employee in main.employees">
它基本上与命令行相同。
相信这段代码有点脏,仅用于示例目的。我建议你自己做,并像我一样避免使用oneliner。
答案 2 :(得分:0)
我有一种预感,Python可以比sed
更快地完成这项工作,但到目前为止我还没有时间检查,所以...根据您对Arount的回答评论:
我的真实文件实际上非常大,命令行比python脚本快
这不一定是真的,事实上,在你的情况下,我怀疑Python可以做到很多,比sed
快很多倍,因为你使用Python&#39 ;不仅限于通过行缓冲区迭代文件,也不需要一个完整的正则表达式引擎来获取行分隔符。
我不确定您的文件有多大,但我将测试示例生成为:
with open("example.txt", "w") as f:
for i in range(10**8): # I would consider 100M lines as "big" enough for testing
print(i, file=f)
这实际上创建了一个100M行长(888.9MB)的文件,每行都有不同的编号。
现在,单独计算sed
命令,以最高优先级(chrt -f 99
)运行会导致:
[zwer@testbed ~]$ sudo chrt -f 99 /usr/bin/time --verbose \
> sed -e '1s/^/[/' -e 's/$/,/' -e '$s/,$/]/' example.txt > output.txt
Command being timed: "sed -e 1s/^/[/ -e s/$/,/ -e $s/,$/]/ example.txt"
User time (seconds): 56.89
System time (seconds): 1.74
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:59.28
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1044
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1
Minor (reclaiming a frame) page faults: 313
Voluntary context switches: 7
Involuntary context switches: 29
Swaps: 0
File system inputs: 1140560
File system outputs: 1931424
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
如果您实际上是从Python调用它,结果会更糟,因为它还会带有subprocess
和STDOUT重定向开销。
但是,如果我们将它留给Python来完成所有工作而不是sed
:
import sys
CHUNK_SIZE = 1024 * 64 # 64k, tune this to the FS block size / platform for best performance
with open(sys.argv[2], "w") as f_out: # open the file from second argument for writing
f_out.write("[") # start the JSON array
with open(sys.argv[1], "r") as f_in: # open the file from the first argument for reading
chunk = None
last_chunk = '' # keep a track of the last chunk so we can remove the trailing comma
while True:
chunk = f_in.read(CHUNK_SIZE) # read the next chunk
if chunk:
f_out.write(last_chunk) # write out the last chunk
last_chunk = chunk.replace("\n", ",\n") # process the new chunk
else: # EOF
break
last_chunk = last_chunk.rstrip() # clear out the trailing whitespace
if last_chunk[-1] == ",": # clear out the trailing comma
last_chunk = last_chunk[:-1]
f_out.write(last_chunk) # write the last chunk
f_out.write("]") # end the JSON array
没有触及shell导致:
[zwer@testbed ~]$ sudo chrt -f 99 /usr/bin/time --verbose \
> python process_file.py example.txt output.txt
Command being timed: "python process_file.py example.txt output.txt"
User time (seconds): 1.75
System time (seconds): 0.72
Percent of CPU this job got: 93%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.65
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 4716
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 3
Minor (reclaiming a frame) page faults: 14835
Voluntary context switches: 16
Involuntary context switches: 0
Swaps: 0
File system inputs: 3120
File system outputs: 1931424
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
考虑到利用率,瓶颈实际上是I / O,留给自己的设备(或者从非常快速的存储而不是像我的测试平台上的虚拟化硬盘一样工作)Python可以做得更快。
因此,执行与Python相同的任务需要sed
32.5倍。即使您要优化sed
,Python仍然可以更快地工作,因为sed
仅限于行缓冲区,因此输入I / O会浪费大量时间(比较数字)在上面的基准测试中,并没有(简单)的方法。
结论:对于此特定任务,Python
比sed
更快。