如何在R中拆分多行文字?

时间:2013-03-20 04:26:04

标签: r split

我有一个输入文件有一个段落。我需要将逐段分割成两个子段。

paragraph.xml

 <Text>
      This is first line.
      This is second line.
      \delemiter\new\one
      This is third line.
      This is fourth line.
 </Text>

R代码:

doc<-xmlTreeParse("paragraph.xml")
top = xmlRoot(doc)
text<-top[[1]]

我需要将此段分为两段。

1款

 This is first line.
 This is second line.

款2

  This is third line.
  This is fourth line.

我发现strsplit函数非常有用但它从不分割多行文本。

2 个答案:

答案 0 :(得分:2)

由于您有xml文件,因此最好使用XML软件包工具。我看到你在这里开始使用它是你开始的连续性。

library(XML)
doc <- xmlParse('paragraph.xml') ## equivalent xmlTreeParse (...,useInternalNodes =TRUE)
## extract the text of the node Text
mytext = xpathSApply(doc,'//Text/text()',xmlValue)
## convert it to a list of lines using scan
lines <- scan(text=mytext,sep='\n',what='character')
## get the delimiter index
delim <- which(lines == "\\delemiter\\new\\one")
## get the 2 paragraphes
p1 <- lines[seq(delim-1)]
p2 <- lines[seq(delim+1,length(lines))]

然后,您可以使用pastewrite来获取段落结构,例如,使用write

write(p1,"",sep='\n')

This is first line.
This is second line.

答案 1 :(得分:1)

以下是一种迂回的可能性,使用splitgreplcumsum

一些示例数据:

temp <- c("This is first line.", "This is second line.", 
          "\\delimiter\\new\\one", "This is third line.", 
          "This is fourth line.", "\\delimiter\\new\\one",
          "This is fifth line")
# [1] "This is first line."   "This is second line."  "\\delimiter\\new\\one"
# [4] "This is third line."   "This is fourth line."  "\\delimiter\\new\\one"
# [7] "This is fifth line"   

使用split上的cumsum生成“群组”后使用grepl

temp1 <- split(temp, cumsum(grepl("delimiter", temp)))
temp1
# $`0`
# [1] "This is first line."  "This is second line."
# 
# $`1`
# [1] "\\delimiter\\new\\one" "This is third line."   "This is fourth line." 
# 
# $`2`
# [1] "\\delimiter\\new\\one" "This is fifth line"  

如果需要进一步清理,可以选择以下一个选项:

lapply(temp1, function(x) {
  x[grep("delimiter", x)] <- NA
  x[complete.cases(x)]
})
# $`0`
# [1] "This is first line."  "This is second line."
# 
# $`1`
# [1] "This is third line."  "This is fourth line."
# 
# $`2`
# [1] "This is fifth line"