我希望这不是一个愚蠢的问题。我有这种格式的数据
# id 001
# This is a sentence.
# text_sentence 001_01
1 This
2 is
3 a
4 sentence
# id 002
# This is also a sentence.
# text_sentence 001_02
1 This
2 is
3 also
4 a
5 sentence
# id 003
# This too is a sentence.
# text_sentence 002_01
1 This
2 too
3 is
4 a
5 sentence
每个编号的行有8列。以#开头的行只是一列。我需要转换数据,以便每个文本和句子编号构成一个新列。我还会删除以#开头的行(我认为使用filter()很容易),但我无法弄清楚如何定位第三个'comment'行中的数字,所以数据看起来像这样:
001 01 1 This
001 01 2 is
001 01 3 a
001 01 4 sentence
001 02 1 This
001 02 2 is
001 02 3 also
001 02 4 a
001 02 5 sentence
002 01 1 This
002 01 2 too
002 01 3 is
002 01 4 a
002 01 5 sentence
我有大约50个这样的文本准备分析,每个都有数百或数千行,所以这不是手工操作。我对这种类型的数据分析比较陌生,但我读过的内容并没有描述这样的转换。这可能非常简单,但如果有人有想法,我会非常感激。
答案 0 :(得分:1)
这使用单个正则表达式函数grepl
和strsplit
函数来提取相关值。它有点粗糙,准备好了;例如,它假定文件没有丢失标题行。这足以让你了解如何处理这样的事情。我已经这样做了,好像你把它写到文件中一样,但它可以很容易地放入数据框中。
# your sample was saved in a file
# read the entire file into a vector
d <- readLines('test.txt')
# start with line 1
countLines <- 1
while (countLines <= length(d)) {
# skip down to the third header line
countLines <- countLines + 2
# process it
currLine <- d[countLines]
# split on spaces then on underscore to extract
# the i.d. and the sentence number
idset <- strsplit(strsplit(currLine, ' ')[[1]][3], '_')
countLines <- countLines + 1
# write out the words until the next header line
# that is, a line without a digit in the first column
while(grepl('^\\d',d[countLines])) {
print(sprintf('%s %s %s', idset[[1]][1], idset[[1]][2], d[countLines]))
countLines <- countLines + 1
}
}
strsplit
函数返回一个由1个项目组成的列表 - 一个字符向量。 idset[[1]]
是唯一的列表元素; idset[[1]][1]
是角色向量中的第一个项目。因此像# text_sentence 002_01
这样的行将被分成三个块,然后将第三个块分成两个块。
grepl
函数寻找的模式是一个数字(\\d
),它是该行(^
)中的第一个字符。
答案 1 :(得分:0)
您可以尝试这样的事情:
代码:
# data is stored in data.frame df
# Find lines with id
idLine <- grep("^# id ", df$V1)
# [1] 1 8 16
# Iterate over all id's
result <- lapply(seq_along(idLine), function(i) {
# Extract line in which id was found
x <- idLine[i]
# Extract ID from id line
ID1 <- sub(".* ", "", df$V1[x])
# Extract second ID from id line + 2
ID2 <- sub(".*_", "", df$V1[x + 2])
# Extract text between current and following id
sentence <- df[(x + 3):(ifelse(i != length(idLine), idLine[i + 1] - 1, nrow(df))), ]
data.frame(ID1, ID2,
num = as.numeric(sub(" .*", "" ,sentence)),
text = sub(".* ", "" ,sentence))
})
do.call(rbind, result)
结果:
# ID1 ID2 num text
# 001 01 1 This
# 001 01 2 is
# 001 01 3 a
# 001 01 4 sentence
# 002 02 1 This
# 002 02 2 is
# 002 02 3 also
# 002 02 4 a
# 002 02 5 sentence
# 003 01 1 This
# 003 01 2 too
# 003 01 3 is
# 003 01 4 a
# 003 01 5 sentence
数据(df
):
df <- structure(list(V1 = c("# id 001", "# This is a sentence.", "# text_sentence 001_01",
"1 This", "2 is", "3 a", "4 sentence", "# id 002", "# This is also a sentence.",
"# text_sentence 001_02", "1 This", "2 is", "3 also", "4 a",
"5 sentence", "# id 003", "# This too is a sentence.", "# text_sentence 002_01",
"1 This", "2 too", "3 is", "4 a", "5 sentence")), .Names = "V1", row.names = c(NA,
-23L), class = "data.frame")
您可以使用readLines(file.txt)
或data.table::fread("file.txt", sep = "\n")