Question

我希望这不是一个愚蠢的问题。我有这种格式的数据

# id 001
# This is a sentence.
# text_sentence 001_01
1   This
2   is
3   a
4   sentence
# id 002
# This is also a sentence.
# text_sentence 001_02
1   This
2   is
3   also
4   a
5   sentence
# id 003
# This too is a sentence.
# text_sentence 002_01
1   This
2   too
3   is
4   a
5   sentence

每个编号的行有8列。以＃开头的行只是一列。我需要转换数据，以便每个文本和句子编号构成一个新列。我还会删除以＃开头的行（我认为使用filter（）很容易），但我无法弄清楚如何定位第三个'comment'行中的数字，所以数据看起来像这样：

001   01   1   This
001   01   2   is
001   01   3   a
001   01   4   sentence
001   02   1   This
001   02   2   is
001   02   3   also
001   02   4   a
001   02   5   sentence
002   01   1   This
002   01   2   too    
002   01   3   is
002   01   4   a    
002   01   5   sentence

我有大约50个这样的文本准备分析，每个都有数百或数千行，所以这不是手工操作。我对这种类型的数据分析比较陌生，但我读过的内容并没有描述这样的转换。这可能非常简单，但如果有人有想法，我会非常感激。

Answer 1

这使用单个正则表达式函数grepl和strsplit函数来提取相关值。它有点粗糙，准备好了;例如，它假定文件没有丢失标题行。这足以让你了解如何处理这样的事情。我已经这样做了，好像你把它写到文件中一样，但它可以很容易地放入数据框中。

# your sample was saved in a file
# read the entire file into a vector
d <- readLines('test.txt')
# start with line 1
countLines <- 1
while (countLines <= length(d)) {
    # skip down to the third header line
    countLines <- countLines + 2
    # process it
    currLine <- d[countLines]
    # split on spaces then on underscore to extract
    # the i.d. and the sentence number
    idset <- strsplit(strsplit(currLine, ' ')[[1]][3], '_')
    countLines <- countLines + 1
    # write out the words until the next header line
    # that is, a line without a digit in the first column
    while(grepl('^\\d',d[countLines])) {
        print(sprintf('%s   %s   %s', idset[[1]][1], idset[[1]][2], d[countLines]))
        countLines <- countLines + 1
    }
}

strsplit函数返回一个由1个项目组成的列表 - 一个字符向量。 idset[[1]]是唯一的列表元素; idset[[1]][1]是角色向量中的第一个项目。因此像# text_sentence 002_01这样的行将被分成三个块，然后将第三个块分成两个块。

grepl函数寻找的模式是一个数字（\\d），它是该行（^）中的第一个字符。

Answer 2

您可以尝试这样的事情：

代码：

# data is stored in data.frame df

# Find lines with id
idLine <- grep("^# id ", df$V1)
# [1]  1  8 16
# Iterate over all id's
result <- lapply(seq_along(idLine), function(i) {
    # Extract line in which id was found
    x <- idLine[i]
    # Extract ID from id line
    ID1 <- sub(".* ", "", df$V1[x])
    # Extract second ID from id line + 2
    ID2 <- sub(".*_", "", df$V1[x + 2])
    # Extract text between current and following id
    sentence <- df[(x + 3):(ifelse(i != length(idLine), idLine[i + 1] - 1, nrow(df))), ]
    data.frame(ID1, ID2, 
               num = as.numeric(sub(" .*", "" ,sentence)), 
               text = sub(".* ", "" ,sentence))
})
do.call(rbind, result)

结果：

# ID1 ID2 num     text
# 001  01   1     This
# 001  01   2       is
# 001  01   3        a
# 001  01   4 sentence
# 002  02   1     This
# 002  02   2       is
# 002  02   3     also
# 002  02   4        a
# 002  02   5 sentence
# 003  01   1     This
# 003  01   2      too
# 003  01   3       is
# 003  01   4        a
# 003  01   5 sentence

数据（df）：

df <- structure(list(V1 = c("# id 001", "# This is a sentence.", "# text_sentence 001_01", 
"1   This", "2   is", "3   a", "4   sentence", "# id 002", "# This is also a sentence.", 
"# text_sentence 001_02", "1   This", "2   is", "3   also", "4   a", 
"5   sentence", "# id 003", "# This too is a sentence.", "# text_sentence 002_01", 
"1   This", "2   too", "3   is", "4   a", "5   sentence")), .Names = "V1", row.names = c(NA, 
-23L), class = "data.frame")

您可以使用readLines(file.txt)或data.table::fread("file.txt", sep = "\n")

读取您的数据

R - 添加文字＆amp;行内容中的句子ID列

2 个答案: