当我尝试通过以下方式将下表读入dataframe(data100)时
data100 <- read.table(header=TRUE, text='
verb_object SESSION_ID
1: BA31C1CC63E5043483FAE25F085E25E5 INSERT 41595370
2: BECE6374D91D47E6285EFDEBA6D65BB9 DATABASE 41595371
3: 26D695C8CA82CAFFDF985201F3AA44D7 UPDATE 41595282
4: 26D695C8CA82CAFFDF985201F3AA44D7 UPDATE 41595282
5: 2BC5A4199A0DDA16FA17A9CA1AA17C02 DATABASE 41595373
6: 6D944D54C54ED75D487288FE1505BB59 INSERT 41595368
')
I get the following error:
Error in read.table(header = TRUE, text = "\n verb_object SESSION_ID\n BA31C1CC63E5043483FAE25F085E25E5 INSERT 41595370\n BECE6374D91D47E6285EFDEBA6D65BB9 DATABASE 41595371\n 26D695C8CA82CAFFDF985201F3AA44D7 UPDATE 41595282\n 26D695C8CA82CAFFDF985201F3AA44D7 UPDATE 41595282\n 2BC5A4199A0DDA16FA17A9CA1AA17C02 DATABASE 41595373\n 6D944D54C54ED75D487288FE1505BB59 INSERT 41595368\n") :
duplicate 'row.names' are not allowed
我怎么读?
使用
后lines <- readLines(textConnection(" verb_object SESSION_ID
> data100<-read.table(text=gsub('(?<=\\:)\\s+|\\s+(?=\\s[0-9])', " '", lines, perl=TRUE), sep='', fill=TRUE)
结果如下:
> data100
V1 V2 V3 V4 V5 V6 V7
1 verb_object SESSION_ID NA NA
2 1: BA31C1CC63E5043483FAE25F085E25E5 INSERT 41595370 2: BECE6374D91D47E6285EFDEBA6D65BB9 DATABASE 41595371
3 3: 26D695C8CA82CAFFDF985201F3AA44D7 UPDATE 41595282 4: 26D695C8CA82CAFFDF985201F3AA44D7 UPDATE 41595282
4 5: 2BC5A4199A0DDA16FA17A9CA1AA17C02 DATABASE 41595373 6: 6D944D54C54ED75D487288FE1505BB59 INSERT 41595368
>
答案 0 :(得分:1)
我们可以使用readLines
阅读,使用gsub
放置引号,然后使用read.table
阅读
lines <- readLines(textConnection("verb_object SESSION_ID
1: BA31C1CC63E5043483FAE25F085E25E5 INSERT 41595370
2: BECE6374D91D47E6285EFDEBA6D65BB9 DATABASE 41595371
3: 26D695C8CA82CAFFDF985201F3AA44D7 UPDATE 41595282
4: 26D695C8CA82CAFFDF985201F3AA44D7 UPDATE 41595282
5: 2BC5A4199A0DDA16FA17A9CA1AA17C02 DATABASE 41595373
6: 6D944D54C54ED75D487288FE1505BB59 INSERT 41595368"))
read.table(text=gsub('(?<=\\:)\\s+|\\s+(?=\\s[0-9])', " '", lines, perl=TRUE), sep='')
# verb_object SESSION_ID
#1: BA31C1CC63E5043483FAE25F085E25E5 INSERT 41595370
#2: BECE6374D91D47E6285EFDEBA6D65BB9 DATABASE 41595371
#3: 26D695C8CA82CAFFDF985201F3AA44D7 UPDATE 41595282
#4: 26D695C8CA82CAFFDF985201F3AA44D7 UPDATE 41595282
#5: 2BC5A4199A0DDA16FA17A9CA1AA17C02 DATABASE 41595373
#6: 6D944D54C54ED75D487288FE1505BB59 INSERT 41595368
OP的新数据集可以像以前一样用readLines
读取,
lines <- readLines(textConnection("items newitem
1: BA31C1CC63E5043483FAE25F085E25E5 INSERT OV1
2: BECE6374D91D47E6285EFDEBA6D65BB9 DATABASE OV2
3: 26D695C8CA82CAFFDF985201F3AA44D7 UPDATE OV3
4: 2BC5A4199A0DDA16FA17A9CA1AA17C02 DATABASE OV4
5: 6D944D54C54ED75D487288FE1505BB59 INSERT OV5"))
我们应该注意,我们在早期数据集(\\s+(?=\\s[0-9])
)中匹配的模式在这里不起作用,因为'SESSIONID'中的第一个字符是数字,而在'newitem'中它是一个大写字母。因此,我们会匹配一个或多个不是:
的字符,从字符串的开头(^[^:]+
)后跟:
,后跟一个或多个空格(\\s+
) ,然后我们使用括号()
将字符捕获为一组,即一个或多个不是空格的字符,后跟一个或多个空格,而字符不是空格(([^ ]+\\s+[^ ]+)
,匹配一个或多个空格({{ 1}})后跟一个或多个字符,直到字符串结尾为另一个捕获组(\\s+
)。我们通过在第一个捕获组((.*)$
)周围放置引号来替换,后跟空格第二个捕获组。
'\\1'