使用多字符分隔符将文本文件读取到R中的列表

时间:2016-10-23 23:32:05

标签: r io

我有一个包含文本数据的文本文件(我的意思是,不是表格或数字,而是实际的英文句子)。每两句话都有一个三重星号(***)来区分前一句和下一句。我需要每个组作为列表的元素。我已经尝试了readLinesreadcharstrsplit,但无法使其发挥作用。 这是一个例子:

Hello Everyone.
My name is James.

***

Hello James!
My name is Amy.
Nice to meet you.

***

Hi Amy!
My name is Sue.

所以我需要一个包含三个元素的列表,每个元素都是一个包含该组的向量。请注意,组中的句子中有换行符。

2 个答案:

答案 0 :(得分:2)

假设data.txt保留您的文字条目。这是你想要的(在基础R中):

data <- readLines("data.txt");

#Optionally remove empty lines
data <- data[data != ""];

# Split based on triple asterisk entries
lst <- split(data, cumsum(data == "***"));

# Remove triple asterisk entries
lst <- lapply(lst, function(x) x[x != "***"])
print(lst);

$`0`
[1] "Hello Everyone."   "My name is James."

$`1`
[1] "Hello James!"    "My name is Amy."

$`2`
[1] "Hi Amy!"         "My name is Sue."

答案 1 :(得分:1)

试试这个。如果您的文字位于文件中,请将textConnection(Lines)替换为"myfile.txt"

之类的内容
Lines <- "Hello Everyone.
My name is James.

***

Hello James!
My name is Amy.
Nice to meet you.

***

Hi Amy!
My name is Sue."

# L <- paste(readLines("myfile.txt"), collapse = "\n")
L <- paste(readLines(textConnection(Lines)), collapse = "\n")
v <- strsplit(L, "\n\n***\n\n", fixed = TRUE)[[1]]

给出以下长度为3的字符向量:

> v
[1] "Hello Everyone.\nMy name is James."              
[2] "Hello James!\nMy name is Amy.\nNice to meet you."
[3] "Hi Amy!\nMy name is Sue."

如果你想要一个单独行的字符向量列表而不是字符向量,那么再次应用strsplit

strsplit(v, "\n")

或者如果您只想强迫v列表:

as.list(v)