获取文本文件的特定部分及其内容

时间:2018-07-25 15:24:49

标签: r

我有一个文本文件,如下所示:

1 Hello
1.1 Hi
1.2 Hey
2 Next section
2.1 New section
3 thrid
4 last

我还有另一个类似的文本文件。

1 Hello
My name is John. It was nice to meet you.
1.1 Hi
Hi again. My last name is Doe.
1.1.1 Bye
1.2 Hey
Greetings.
2 Next section
This is the second section. I am majoring in CS.
2.1 New Section
Welcome. I am an undergraduate student.
3 third
1. hi
2. hello
3. hey
4 last

我想知道如何读取上一个文本文件中的数据,并用它来查找第二个数据文件中的特定部分以及下一个部分之后的所有内容。所以基本上,我正在尝试获得类似的东西:

Section      Content
1 Hello      My name is John. It was nice to meet you.
1.1 Hi       Hi again. My last name is Doe. 1.1.1 Bye
1.2 Hey      Greetings.

...等等

我想知道我该怎么做。

2 个答案:

答案 0 :(得分:1)

以下解决方案当然可以得到改进,但可以为您提供解决问题的思路。根据您需要处理的文件的大小和结构,这种方法可能是可行的,或者需要对节的检测和速度进行更多的调整。

file1 = 
"1 Hello
1.1 Hi
1.2 Hey
2 Next section
2.1 New section
3 thrid
4 last"

file2 = 
"1 Hello
My name is John. It was nice to meet you.
1.1 Hi
Hi again. My last name is Doe.
1.1.1 Bye
1.2 Hey
Greetings.
2 Next section
This is the second section. I am majoring in CS.
2.1 New Section
Welcome. I am an undergraduate student.
3 third
1. hi
2. hello
3. hey
4 last"

file1 = unlist(strsplit(file1, "\n", fixed = T))
file2 = unlist(strsplit(file2, "\n", fixed = T))
positions = unlist(sapply(file1, function(x) grep(paste0("^", x, "$"), file2, ignore.case = T)))
positions = cbind(positions, c(positions[-1]-1, length(file2)))
text = mapply(function(x, y) file2[x:y], positions[,1], positions[,2])             
text = lapply(text, function(x) x[-1])
result = cbind(positions, text)
result
# positions    text                                              
# 1 Hello         1         2  "My name is John. It was nice to meet you."       
# 1.1 Hi          3         5  Character,2                                       
# 1.2 Hey         6         7  "Greetings."                                      
# 2 Next section  8         9  "This is the second section. I am majoring in CS."
# 2.1 New section 10        15 Character,5                                       
# 4 last          16        16 Character,0  

# Note that the text column contains lists storing the individual lines.
# e.g. for "2.1 New section":
class(result[5, "text"])
# list
result[5, "text"]
# [[1]]
# [1] "Welcome. I am an undergraduate student." "3 third"  #<< note the different spelling of third                              
# [3] "1. hi"                                   "2. hello"                               
# [5] "3. hey"  

答案 1 :(得分:0)

这个问题的答案是可以的。实现将根据您用于完成此任务的编程语言而千差万别。高层次的概述将会是

  1. 将原始文件按行拆分为字符串数组。这些是您用于搜索第二个文档的键的列表。
  2. 将第二个文件读入字符串变量
  3. 遍历所有键(迭代器x)并在第二个文档中找到它们的索引。像
  

int start = seconddocument.indexof(keys [x]);
  int end = seconddocument.indexof(keys [x + 1]);

  1. 然后使用这些开始和结束位置,您可以使用substring()函数提取内容。
  

stringmatchedContent = seconddocument.substring(开始,结束);

这一直有效,直到您找到最后一个匹配项,因为key [x + 1]在x是最后一个键的情况下将不存在。在这种情况下,必须将end设置为文档中最后一个字符的位置,或者您使用仅以起点作为子字符串的方法。

HTH