Question

# 1
preliminari report intern algebra languag cacm decemb 1958 perli a
j samelson k ca581203 jb march 22 1978 8 28
pm 100 5 1 123 5 1 164 5 1
1 5 1 1 5 1 1 5 1 205
5 1 210 5 1 214 5 1 1982 5
1 398 5 1 642 5 1 669 5 1
1 6 1 1 6 1 1 6 1 1
6 1 1 6 1 1 6 1 1 6
1 1 6 1 1 6 1 1 6 1
165 6 1 196 6 1 196 6 1 1273
6 1 1883 6 1 324 6 1 43 6
1 53 6 1 91 6 1 410 6 1
3184 6 1 
# 2
extract of root by repeat subtract for digit comput cacm
decemb 1958 sugai i ca581202 jb march 22 1978 8
29 pm 2 5 2 2 5 2 2 5
2 
# 3
techniqu depart on matrix program scheme cacm decemb 1958 friedman
m d ca581201 jb march 22 1978 8 30 pm
3 5 3 3 5 3 3 5 3 
# 4
glossari of comput engin and program terminolog cacm novemb 1958
ca581103 jb march 22 1978 8 32 pm 4 5
4 4 5 4 4 5 4 
# 5
two squar root approxim cacm novemb 1958 wadei w g
ca581102 jb march 22 1978 8 33 pm 5 5
5 5 5 5 5 5 5 
# 6
the us of comput in inspect procedur cacm novemb 1958
muller m e ca581101 jb march 22 1978 8 33
pm 6 5 6 6 5 6 6 5 6
477 5 6 6 6 6 
# 7
glossari of comput engin and program terminolog cacm octob 1958
ca581003 jb march 22 1978 8 35 pm 7 5
7 7 5 7 7 5 7 
# 8
on the equival and transform of program scheme cacm octob
1958 friedman m d ca581002 jb march 22 1978 8
36 pm 8 5 8 8 5 8 8 5
8 
# 9
propos for a...

我有这个语料库，要检索大约3000个文件。我想创建一个新文件夹，我可以保存这些文件，如1.txt，2.txt等。每个文档都以＃开头。例如，1.txt将包含从＃1到＃2的所有内容，2.txt将包含从＃2到＃3的所有内容，依此类推。非常感谢任何帮助。

Answer 1

让我们假设您的语料库位于名为corpus.txt的文件中，如下所示：

# 1
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, 
sed diam nonumy eirmod tempor invidunt ut labore et dolore 
magna aliquyam erat, sed diam voluptua.
# 2
At vero eos et accusam et justo duo dolores et ea rebum. 
Stet clita kasd gubergren, no sea takimata sanctus est 
Lorem ipsum dolor sit amet.
# 3
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, 
sed diam nonumy eirmod tempor invidunt ut labore et dolore 
magna aliquyam erat, sed diam voluptua. At vero eos et 
accusam et justo duo dolores et ea rebum.

您可以使用readLines导入数据，然后提取以#开头的元素的索引。然后根据这些索引拆分文本文件，并在循环中生成单独的文本文件。

示例：

## Import corpus:
textVec <- readLines("corpus.txt")

## Find indices of the lines starting with '#':
indexVec <- c(grep("^#", textVec), length(textVec) + 1)

## Split corpus:
textList <- lapply(1:(length(indexVec) - 1), 
    function(ii) textVec[(indexVec[ii]+1):(indexVec[ii+1] - 1)])

## Generate text files:
for (ii in seq(along = textList)) writeLines(textList[[ii]], con = paste0(ii, ".txt"))

如何从R中的整个语料库中提取文档？

1 个答案: