我有一个输入文件有一个段落。我需要将逐段分割成两个子段。
paragraph.xml
<Text>
This is first line.
This is second line.
\delemiter\new\one
This is third line.
This is fourth line.
</Text>
R代码:
doc<-xmlTreeParse("paragraph.xml")
top = xmlRoot(doc)
text<-top[[1]]
我需要将此段分为两段。
1款
This is first line.
This is second line.
款2
This is third line.
This is fourth line.
我发现strsplit函数非常有用但它从不分割多行文本。
答案 0 :(得分:2)
由于您有xml文件,因此最好使用XML
软件包工具。我看到你在这里开始使用它是你开始的连续性。
library(XML)
doc <- xmlParse('paragraph.xml') ## equivalent xmlTreeParse (...,useInternalNodes =TRUE)
## extract the text of the node Text
mytext = xpathSApply(doc,'//Text/text()',xmlValue)
## convert it to a list of lines using scan
lines <- scan(text=mytext,sep='\n',what='character')
## get the delimiter index
delim <- which(lines == "\\delemiter\\new\\one")
## get the 2 paragraphes
p1 <- lines[seq(delim-1)]
p2 <- lines[seq(delim+1,length(lines))]
然后,您可以使用paste
或write
来获取段落结构,例如,使用write
:
write(p1,"",sep='\n')
This is first line.
This is second line.
答案 1 :(得分:1)
以下是一种迂回的可能性,使用split
,grepl
和cumsum
。
一些示例数据:
temp <- c("This is first line.", "This is second line.",
"\\delimiter\\new\\one", "This is third line.",
"This is fourth line.", "\\delimiter\\new\\one",
"This is fifth line")
# [1] "This is first line." "This is second line." "\\delimiter\\new\\one"
# [4] "This is third line." "This is fourth line." "\\delimiter\\new\\one"
# [7] "This is fifth line"
使用split
上的cumsum
生成“群组”后使用grepl
:
temp1 <- split(temp, cumsum(grepl("delimiter", temp)))
temp1
# $`0`
# [1] "This is first line." "This is second line."
#
# $`1`
# [1] "\\delimiter\\new\\one" "This is third line." "This is fourth line."
#
# $`2`
# [1] "\\delimiter\\new\\one" "This is fifth line"
如果需要进一步清理,可以选择以下一个选项:
lapply(temp1, function(x) {
x[grep("delimiter", x)] <- NA
x[complete.cases(x)]
})
# $`0`
# [1] "This is first line." "This is second line."
#
# $`1`
# [1] "This is third line." "This is fourth line."
#
# $`2`
# [1] "This is fifth line"