我已经以pdf格式在线下载了一本书,并希望在我的ios项目中使用该书。所需格式为xml。格式如下:
<q>question here</q>
<a>answer here</a>
<q>question2</q>
<a>answer2</a>
pdf格式如下:
the question is centered
the answer has several paragraphs that start with 4 white space.
This is another paragraph
This is the second question and so on
This is the answer to the second question
The third question and there may be a blank line above
This is the 4th question and no blank line above
我尝试使用word / pages将pdf转换为txt并逐行阅读文本,但我无法识别问题和答案。另一个问题是当我进行转换时,pdf的自动包装将转换为换行符。
注意:过程是
pdf -> use word/pages -> txt -> python program -> xml -> python program -> sqlite database
关键部分是如何将pdf转换为正确的xml文件。
答案 0 :(得分:0)
恕我直言,你可以从github.com或其他地方找到一个可用的,开源的和友好的pdf查看器。然后你可以解析转换后的文本并生成xml。