Question

我正在使用read.delim函数，但是由于我正在阅读的文本行还包含用户使用逗号（“，”）的注释，因此注释分为两列或更多列。

下面是数据集中的两行：

@Zillaman您只是在Zina婴儿床上停下了所有食物，甚至没有想到我！!!!，0

Peepers先生刚开始时我只有11岁。我相信Sun是整个家庭的必游之地...，1

正确读取第一行。在下一列中读取“ 0”。第二行分为三列，最后一列包含“ 1”

dataset_original = read.delim('TrainingData.csv', 
                              quote = "",
                              row.names = NULL, 
                              stringsAsFactors = FALSE,
                              header = F, as.is = F,
                              colClasses = "character",
                              blank.lines.skip = T,
                              sep = ",")

Answer 1

尝试逐行阅读所有行，然后将文本和目标列分开。

尝试一下：

df= read.delim('TrainingData.csv',
               quote = "",
               row.names = NULL,
               stringsAsFactors = FALSE,
               header = F, as.is = F,
               colClasses = "character",
               blank.lines.skip = T,
               sep = "\n")


df$target = regmatches(df$V1, regexpr(pattern = "[^,]*$", text = df$V1))
df$V1 = sub(pattern = ",[^,]*$", replacement = "", x = df$V1)

其中df代表dataset_original

示例：

文件包含：

hello,0
world,1
not,right,1
this,one,is,even,worse,0

此方法返回：

> df
                      V1 target
1                  hello      0
2                  world      1
3              not,right      1
4 this,one,is,even,worse      0

Answer 2

如果我们使用readLines()读取文件，则可以在最后一个逗号处进行分割。

write(x="@Zillaman u just aite all types of food at Zina crib and didnt even think about me!!!!,0

I must have been only 11 when Mr Peepers started. It was a must see for the whole family, I believe on Sun...,1", 
file="file.txt")

gg <- readLines("file.txt")

spl <- strsplit(gg, ",(?=[^,]+$)", perl=TRUE)
dtf <- as.data.frame(do.call(rbind, spl), stringsAsFactors=FALSE)

dtf
#                                                     V1  V2
# 1 @Zillaman u just (...) didnt even think about me!!!!   0
# 2 I must have been (...) family, I believe on Sun...     1

如何处理包含带有逗号的文本行的.csv文件？

2 个答案:

示例：