Question

我正在尝试使用python和pandoc将几个html转换为latex，我遇到了一些问题。

要使用pandoc传达我的python脚本，我使用subprocess.Popen，将stdout重定向到我正在保存的文件中，以便将其包含在乳胶模板中。

如果我使用经典的方式实施Popen

from subprocess import Popen, PIPE, STDOUT

filedesc = open('myfile.tex','w')
args = ['pandoc', '-f', 'html', '-t', 'latex']
p = Popen(args, stdout=PIPE, stdin=PIPE, stderr=STDOUT)
outp, err = p.communicate(input=html)
filedesc.write(outp)

我得到的行有一个额外的新行，其中不应该有：

＆GT; \ textbf {M。 John Harrison}（Rugby，Warckwickshire，1945）是一个当代的

＆GT;

＆GT;英国作家。

通过将stdout=PIPE更改为文件描述符，可以轻松解决这个问题：

from subprocess import Popen, PIPE, STDOUT

filedesc = open('myfile.tex','w')
args = ['pandoc', '-f', 'html', '-t', 'latex']
p = Popen(args, stdout=filedesc, stdin=PIPE, stderr=STDOUT)
outp, err = p.communicate(input=html)
# not needed
# filedesc.write(outp)

但是如果我想使用字符串缓冲区，则会出现同样的问题，因为我不能将它用作stdout参数。

关于如何阻止Popen / pandoc这样做的任何想法？

谢谢！

Answer 1

好吧，它似乎是python的PIPE（???）中的一种“bug”。

我在Windows系统中执行此代码。这意味着当输入新行时，它们处于CR + LF（\ r \ n）样式而不是（清除程序） LF（\ n）unix风格的新行。

当我引入要通过pandoc转换的大型html文本时，管道将输出返回到命令行。因此，每次达到标准列宽时，都会引入丑陋的“新行”字符。在我的情况下，CR + LF。这让我的输出看起来很奇怪。

我实现的脏解决方案是在编写输出之前添加replace('\r\n','\n')，但我不确定它是否是最优雅的。

from subprocess import Popen, PIPE, STDOUT

html = '<p><b>Some random html code</b> longer than 80 columns ... </p>'
filedesc = open('myfile.tex','w')
args = ['pandoc', '-f', 'html', '-t', 'latex']
p = Popen(args, stdout=PIPE, stdin=PIPE, stderr=STDOUT)
outp, err = p.communicate(input=html)
filedesc.write(outp.replace('\r\n','\n'))**strong text**

使用python Popen和pandoc解析html的不需要的新行？

1 个答案: