我正在尝试将2个文本列转换为1个。
我正在使用pdftools
软件包提取的PDF格式(source)圣经文本:
text <- pdf_text("NewKingJamesBible.pdf")
> length(text) #Total number of pages of document
[1] 1815
这给了我一个字符串,其长度为PDF文件的长度。要访问第60页并将其保存到txt文件(我稍后将其复制到WORD),可以执行以下操作:
fileConn <- file("page_60.txt")
writeLines(text[60], fileConn)
close(fileConn)
text
内的每个对象都是一个页面。每个页面分为两列。您首先阅读左列,然后阅读右列。
但是,我需要将每一页的所有文本都放在一列中。在R中可以这样做吗?
第60页的输出文本如下所示:
16
Then they journeyed from Bethel. And These were the sons of Jacob who were
when there was but a little distance to born to him in Padan Aram.
go to Ephrath, Rachel labored in
childbirth, and she had hard labor. 27
Then Jacob came to his father Isaac at
Mamre, or Kirjath Arba (that is, Hebron),
17
Now it came to pass, when she was in where Abraham and Isaac had dwelt.
hard labor, that the midwife said to her,
"Do not fear; you will have this son 28
Now the days of Isaac were one
also." hundred and eighty years.
18
And so it was, as her soul was 29
So Isaac breathed his last and died,
departing (for she died), that she called and was gathered to his people, being
his name Ben-Oni; but his father called old and full of days. And his sons Esau
him Benjamin. and Jacob buried him.
19
So Rachel died and was buried on the
way to Ephrath (that is, Bethlehem). 36Now this is the genealogy of Esau,
20 who is Edom.
And Jacob set a pillar on her grave,
which is the pillar of Rachel's grave to 2
this day. Esau took his wives from the daughters
of Canaan: Adah the daughter of Elon
21 the Hittite; Aholibamah the daughter of
Then Israel journeyed and pitched his Anah, the daughter of Zibeon the Hivite;
tent beyond the tower of Eder.
3
22 and Basemath, Ishmael's daughter,
And it happened, when Israel dwelt in sister of Nebajoth.
that land, that Reuben went and lay with
Bilhah his father's concubine; and Israel 4
heard about it. Now the sons of Jacob Now Adah bore Eliphaz to Esau, and
were twelve: Basemath bore Reuel.
5
23
the sons of Leah were Reuben, And Aholibamah bore Jeush, Jaalam,
Jacob's firstborn, and Simeon, Levi, and Korah. These were the sons of
Judah, Issachar, and Zebulun; Esau who were born to him in the land
of Canaan.
24
the sons of Rachel were Joseph and 6
Benjamin; Then Esau took his wives, his sons, his
daughters, and all the persons of his
25 household, his cattle and all his animals,
the sons of Bilhah, Rachel's and all his goods which he had gained
maidservant, were Dan and Naphtali; in the land of Canaan, and went to a
26
country away from the presence of his
and the sons of Zilpah, Leah's brother Jacob.
maidservant, were Gad and Asher.
这与dput
形式相同:
16 Then they journeyed from Bethel; and when there was still some distance to go to Ephrath, Rachel began to give birth and she [r]suffered severe labor.
17 When she was in severe labor the midwife said to her, “Do not fear, for now you have another son.”
18 It came about as her soul was departing (for she died), that she named him [s]Ben-oni; but his father called him [t]Benjamin.
19 So Rachel died and was buried on the way to Ephrath (that is, Bethlehem).
20 Jacob set up a pillar over her grave; that is the pillar of Rachel’s grave to this day.
21 Then Israel journeyed on and pitched his tent beyond the [u]tower of [v]Eder.
22 It came about while Israel was dwelling in that land, that Reuben went and lay with Bilhah his father’s concubine, and Israel heard of it.
The Sons of Israel
Now there were twelve sons of Jacob—
23 the sons of Leah: Reuben, Jacob’s firstborn, then Simeon and Levi and Judah and Issachar and Zebulun;
24 the sons of Rachel: Joseph and Benjamin;
25 and the sons of Bilhah, Rachel’s maid: Dan and Naphtali;
26 and the sons of Zilpah, Leah’s maid: Gad and Asher. These are the sons of Jacob who were born to him in Paddan-aram.
27 Jacob came to his father Isaac at Mamre of Kiriath-arba (that is, Hebron), where Abraham and Isaac had sojourned.
28 Now the days of Isaac were one hundred and eighty years.
29 Isaac breathed his last and died and was gathered to his people, an old man [w]of ripe age; and his sons Esau and Jacob buried him.
Esau Moves
36 Now these are the records of the generations of Esau (that is, Edom).
2 Esau took his wives from the daughters of Canaan: Adah the daughter of Elon the Hittite, and Oholibamah the daughter of Anah and the granddaughter of Zibeon the Hivite;
3 also Basemath, Ishmael’s daughter, the sister of Nebaioth.
4 Adah bore Eliphaz to Esau, and Basemath bore Reuel,
5 and Oholibamah bore Jeush and Jalam and Korah. These are the sons of Esau who were born to him in the land of Canaan.
6 Then Esau took his wives and his sons and his daughters and all [x]his household, and his livestock and all his cattle and all his goods which he had acquired in the land of Canaan, and went to another land away from his brother Jacob.
输出输出:
"16\r\n Then they journeyed from Bethel. And These were the sons of Jacob who were\r\nwhen there was but a little distance to born to him in Padan Aram.\r\ngo to Ephrath, Rachel labored in\r\nchildbirth, and she had hard labor. 27\r\n Then Jacob came to his father Isaac at\r\n Mamre, or Kirjath Arba (that is, Hebron),\r\n17\r\n Now it came to pass, when she was in where Abraham and Isaac had dwelt.\r\nhard labor, that the midwife said to her,\r\n\"Do not fear; you will have this son 28\r\n Now the days of Isaac were one\r\nalso.\" hundred and eighty years.\r\n18\r\n And so it was, as her soul was 29\r\n So Isaac breathed his last and died,\r\ndeparting (for she died), that she called and was gathered to his people, being\r\nhis name Ben-Oni; but his father called old and full of days. And his sons Esau\r\nhim Benjamin. and Jacob buried him.\r\n19\r\n So Rachel died and was buried on the\r\nway to Ephrath (that is, Bethlehem). 36Now this is the genealogy of Esau,\r\n20 who is Edom.\r\n And Jacob set a pillar on her grave,\r\nwhich is the pillar of Rachel's grave to 2\r\nthis day. Esau took his wives from the daughters\r\n of Canaan: Adah the daughter of Elon\r\n21 the Hittite; Aholibamah the daughter of\r\n Then Israel journeyed and pitched his Anah, the daughter of Zibeon the Hivite;\r\ntent beyond the tower of Eder.\r\n 3\r\n22 and Basemath, Ishmael's daughter,\r\n And it happened, when Israel dwelt in sister of Nebajoth.\r\nthat land, that Reuben went and lay with\r\nBilhah his father's concubine; and Israel 4\r\nheard about it. Now the sons of Jacob Now Adah bore Eliphaz to Esau, and\r\nwere twelve: Basemath bore Reuel.\r\n 5\r\n23\r\n the sons of Leah were Reuben, And Aholibamah bore Jeush, Jaalam,\r\nJacob's firstborn, and Simeon, Levi, and Korah. These were the sons of\r\nJudah, Issachar, and Zebulun; Esau who were born to him in the land\r\n of Canaan.\r\n24\r\n the sons of Rachel were Joseph and 6\r\nBenjamin; Then Esau took his wives, his sons, his\r\n daughters, and all the persons of his\r\n25 household, his cattle and all his animals,\r\n the sons of Bilhah, Rachel's and all his goods which he had gained\r\nmaidservant, were Dan and Naphtali; in the land of Canaan, and went to a\r\n26\r\n country away from the presence of his\r\n and the sons of Zilpah, Leah's brother Jacob.\r\nmaidservant, were Gad and Asher.\r\n"
奖金:
我尝试使用以下方式下载PDF:
download.file("http://www.apuritansmind.com/wp-content/uploads/PDF/NewKingJamesBible.pdf",
"new_king_james_bible.pdf")
文件大小为6.5 MB,但是在打开文件时,它完全空白。为什么?
我必须手动下载并从PC上读取,才能使用pdftools
提取文本。
答案 0 :(得分:2)
以下方法对您有用吗?
# split character string on line breaks
output.by.line <- strsplit(output, "\n")[[1]]
# consider everything up to the first 42 characters as column 1, everything after as column 2
output.by.line <- c(substring(output.by.line, 1, 42), # column 1
substring(output.by.line, 43)) # column 2
# remove leading / trailing whitespace
output.by.line <- trimws(output.by.line)
# remove blank lines
output.by.line <- output.by.line[nchar(output.by.line) > 0]
# preface each section number with \n to facilitate splitting
# (may require some manual check as not every section number appears to be in its own line)
output.by.line <- ifelse(nchar(output.by.line) <= 2 &
!is.na(as.integer(output.by.line)),
paste0("\n", output.by.line),
output.by.line)
# join all lines together & split by section, dropping empty lines if any
output.by.section <- strsplit(paste(output.by.line, collapse = " "), "\n")[[1]]
output.by.section <- output.by.section[nchar(output.by.section) > 0]
# remove repeated white space inside each section, if any
output.by.section <- stringr::str_squish(output.by.section)
结果:
> output.by.section
[1] "16 Then they journeyed from Bethel. And when there was but a little distance to go to Ephrath, Rachel labored in childbirth, and she had hard labor."
[2] "17 Now it came to pass, when she was in hard labor, that the midwife said to her, \"Do not fear; you will have this son also.\""
[3] "18 And so it was, as her soul was departing (for she died), that she called his name Ben-Oni; but his father called him Benjamin."
[4] "19 So Rachel died and was buried on the way to Ephrath (that is, Bethlehem)."
[5] "20 And Jacob set a pillar on her grave, which is the pillar of Rachel's grave to this day."
[6] "21 Then Israel journeyed and pitched his tent beyond the tower of Eder."
[7] "22 And it happened, when Israel dwelt in that land, that Reuben went and lay with Bilhah his father's concubine; and Israel heard about it. Now the sons of Jacob were twelve:"
[8] "23 the sons of Leah were Reuben, Jacob's firstborn, and Simeon, Levi, Judah, Issachar, and Zebulun;"
[9] "24 the sons of Rachel were Joseph and Benjamin;"
[10] "25 the sons of Bilhah, Rachel's maidservant, were Dan and Naphtali;"
[11] "26 and the sons of Zilpah, Leah's maidservant, were Gad and Asher. These were the sons of Jacob who were born to him in Padan Aram."
[12] "27 Then Jacob came to his father Isaac at Mamre, or Kirjath Arba (that is, Hebron), where Abraham and Isaac had dwelt."
[13] "28 Now the days of Isaac were one hundred and eighty years."
[14] "29 So Isaac breathed his last and died, and was gathered to his people, being old and full of days. And his sons Esau and Jacob buried him. 36Now this is the genealogy of Esau, who is Edom."
[15] "2 Esau took his wives from the daughters of Canaan: Adah the daughter of Elon the Hittite; Aholibamah the daughter of Anah, the daughter of Zibeon the Hivite;"
[16] "3 and Basemath, Ishmael's daughter, sister of Nebajoth."
[17] "4 Now Adah bore Eliphaz to Esau, and Basemath bore Reuel."
[18] "5 And Aholibamah bore Jeush, Jaalam, and Korah. These were the sons of Esau who were born to him in the land of Canaan."
[19] "6 Then Esau took his wives, his sons, his daughters, and all the persons of his household, his cattle and all his animals, and all his goods which he had gained in the land of Canaan, and went to a country away from the presence of his brother Jacob."
(注意:是的,此处没有以36开头的行被标识为新段落,因为它不是自己的行。我不确定哪种最佳方式如果只是几页,进行一些手动检查和更正可能是合理的。否则,这将取决于整个文本中的编号逻辑,并且可能值得一个问题。)
在文件下载部分,您可能希望尝试使用this question中的解决方案(即,将mode = "wb"
指定为download.file
的参数之一)。