我正在使用read.delim
函数,但是由于我正在阅读的文本行还包含用户使用逗号(“,”)的注释,因此注释分为两列或更多列。
下面是数据集中的两行:
@Zillaman您只是在Zina婴儿床上停下了所有食物,甚至没有想到我!!!!,0
Peepers先生刚开始时我只有11岁。我相信Sun是整个家庭的必游之地...,1
正确读取第一行。在下一列中读取“ 0”。第二行分为三列,最后一列包含“ 1”
dataset_original = read.delim('TrainingData.csv',
quote = "",
row.names = NULL,
stringsAsFactors = FALSE,
header = F, as.is = F,
colClasses = "character",
blank.lines.skip = T,
sep = ",")
答案 0 :(得分:2)
尝试逐行阅读所有行,然后将文本和目标列分开。
尝试一下:
df= read.delim('TrainingData.csv',
quote = "",
row.names = NULL,
stringsAsFactors = FALSE,
header = F, as.is = F,
colClasses = "character",
blank.lines.skip = T,
sep = "\n")
df$target = regmatches(df$V1, regexpr(pattern = "[^,]*$", text = df$V1))
df$V1 = sub(pattern = ",[^,]*$", replacement = "", x = df$V1)
其中df
代表dataset_original
文件包含:
hello,0
world,1
not,right,1
this,one,is,even,worse,0
此方法返回:
> df
V1 target
1 hello 0
2 world 1
3 not,right 1
4 this,one,is,even,worse 0
答案 1 :(得分:1)
如果我们使用readLines()
读取文件,则可以在最后一个逗号处进行分割。
write(x="@Zillaman u just aite all types of food at Zina crib and didnt even think about me!!!!,0
I must have been only 11 when Mr Peepers started. It was a must see for the whole family, I believe on Sun...,1",
file="file.txt")
gg <- readLines("file.txt")
spl <- strsplit(gg, ",(?=[^,]+$)", perl=TRUE)
dtf <- as.data.frame(do.call(rbind, spl), stringsAsFactors=FALSE)
dtf
# V1 V2
# 1 @Zillaman u just (...) didnt even think about me!!!! 0
# 2 I must have been (...) family, I believe on Sun... 1