如何在不保留原始格式的情况下将文本从pdf文件复制到文本文件

时间:2018-07-18 16:37:08

标签: r

我有一个要从中提取文本的pdf文件。但是,我不想保持pdf文件的相同间距。我希望文本看起来像是我手动复制并粘贴了pdf中的行一样。这样可以从我的文本文件中消除一些美观但不必要的制表符和空格并发症。

例如,如果我使用R正常提取文本,则将获得类似于以下格式:

                             This is the title
                             of this document
1.0 Hello my name is John and blah balh blah blah blah.
        1.1 blah blah blah blah

如果我只是手动复制和粘贴,我会得到类似的东西:

This is the title of this document
1.0 Hello my name is John and blah balh blah blah blah.
1.1 blah blah blah blah blah

我想知道是否有任何方法可以通过R中的代码来实现,而不仅仅是手动复制和粘贴。

pdf的真实示例是https://www.researchgate.net/profile/James_Hamilton11/publication/24108242_Oil_and_the_Macroeconomy_since_World_War_II/links/0c9605252c0916e709000000.pdf

如果我要手动复制并粘贴228页或pdf第三页的一部分, 我会得到:

Oil and the Macroeconomy since World War 11
James D. Hamilton
University (f/' Virgiiwa
All but one of the U.S. recessions since World War II have been
preceded, typically with a lag of around three-fourths of a year, by a
dramatic increase in the price of crude petroleum. This does not
mean that oil shocks caused these recessions. Evidence is presented,
however, that even over the period 1948-72 this correlation is statistically
significant and nonspurlious, supporting the proposition that
oil shocks were a contributing factor in at least some of the U.S.
recessions prior to 1972. By extension, energy price increases may
account for much of post-OPEC macroeconomic performance.
I. Introduction
The poor performance of the U.S. economy since 1973 is well documented:

1. The rate of growth of real GNP has fallen from an average of
4.0 percent during 1960-72 to 2.4 percent for 1973-81.
2. The 7.6 percent average inflation rate during 1973-81 was
more than double the 3.1 percent realized for 1960-72.
3. The average unemployment rate over 1973-81 of 6.7 percent
was higher than in any year between 1948 and 1972 with the single
exception of the recession of 1958.
This paper is drawn from chap. 2 of my Ph.D. dissertation at the University of
California, Berkeley. Earlier versions of this paper were presented at the NBER/NSF

这与pdf格式完全不同。

奖金: 我在发布的示例中犯了一个错误。如果我从谷歌浏览器的pdf文档中复制并粘贴,则会得到该输出。如果我从Microsoft Edge复制和粘贴,将会得到类似的信息:

Oil and the Macroeconomy since World War 11 
James D. Hamilton 
University (f/' Virgiiwa 
All but one of the U.S. recessions since World War II have been preceded, typically with a lag of around three-fourths of a year, by a dramatic increase in the price of crude petroleum. This does not mean that oil shocks caused these recessions. Evidence is presented, however, that even over the period 1948-72 this correlation is statis- tically significant and nonspurlious, supporting the proposition that oil shocks were a contributing factor in at least some of the U.S. recessions prior to 1972. By extension, energy price increases may account for much of post-OPEC macroeconomic performance. 
I. Introduction 
The poor performance of the U.S. economy since 1973 is well docu- mented: 1. The rate of growth of real GNP has fallen from an average of 4.0 percent during 1960-72 to 2.4 percent for 1973-81. 2. The 7.6 percent average inflation rate during 1973-81 was more than double the 3.1 percent realized for 1960-72. 3. The average unemployment rate over 1973-81 of 6.7 percent was higher than in any year between 1948 and 1972 with the single exception of the recession of 1958. 
This paper is drawn from chap. 2 of my Ph.D. dissertation at the University of California, Berkeley. Earlier versions of this paper were presented at the NBER/NSF 

很抱歉犯错。先前的答案对我当时提出的问题是有效的,但这是我试图获得的输出类型。

1 个答案:

答案 0 :(得分:1)

据我所见,区别仅在于每行开头是否存在空白。您可以使用gsub在R中将其删除。例如:

library(pdftools)
doc <- "https://www.researchgate.net/profile/James_Hamilton11/publication/24108242_Oil_and_the_Macroeconomy_since_World_War_II/links/0c9605252c0916e709000000.pdf"
text <- pdf_text(doc)[[3]]
text_no_ws <- gsub("^|\n +", "\n", text)
cat(text_no_ws)