read.table重复row.names出错

时间:2015-11-01 12:32:58

标签: r

当我尝试通过以下方式将下表读入dataframe(data100)时

data100 <- read.table(header=TRUE, text='
                                 verb_object SESSION_ID
1:   BA31C1CC63E5043483FAE25F085E25E5 INSERT   41595370
2: BECE6374D91D47E6285EFDEBA6D65BB9 DATABASE   41595371
3:   26D695C8CA82CAFFDF985201F3AA44D7 UPDATE   41595282
4:   26D695C8CA82CAFFDF985201F3AA44D7 UPDATE   41595282
5: 2BC5A4199A0DDA16FA17A9CA1AA17C02 DATABASE   41595373
6:   6D944D54C54ED75D487288FE1505BB59 INSERT   41595368
')

I get the following error:
Error in read.table(header = TRUE, text = "\n                               verb_object SESSION_ID\n   BA31C1CC63E5043483FAE25F085E25E5 INSERT   41595370\n                      BECE6374D91D47E6285EFDEBA6D65BB9 DATABASE   41595371\n                         26D695C8CA82CAFFDF985201F3AA44D7 UPDATE   41595282\n                         26D695C8CA82CAFFDF985201F3AA44D7 UPDATE   41595282\n                     2BC5A4199A0DDA16FA17A9CA1AA17C02 DATABASE   41595373\n                         6D944D54C54ED75D487288FE1505BB59 INSERT   41595368\n") : 
  duplicate 'row.names' are not allowed

我怎么读?

使用

lines <- readLines(textConnection("       verb_object SESSION_ID



> data100<-read.table(text=gsub('(?<=\\:)\\s+|\\s+(?=\\s[0-9])', " '", lines, perl=TRUE), sep='', fill=TRUE)

结果如下:

> data100
           V1                               V2       V3       V4 V5                                         V6       V7
1 verb_object                       SESSION_ID                NA                                                     NA
2         1:  BA31C1CC63E5043483FAE25F085E25E5   INSERT 41595370 2: BECE6374D91D47E6285EFDEBA6D65BB9 DATABASE  41595371
3         3:  26D695C8CA82CAFFDF985201F3AA44D7   UPDATE 41595282 4:   26D695C8CA82CAFFDF985201F3AA44D7 UPDATE  41595282
4         5:  2BC5A4199A0DDA16FA17A9CA1AA17C02 DATABASE 41595373 6:   6D944D54C54ED75D487288FE1505BB59 INSERT  41595368
> 

1 个答案:

答案 0 :(得分:1)

我们可以使用readLines阅读,使用gsub放置引号,然后使用read.table阅读

lines <- readLines(textConnection("verb_object SESSION_ID
1:   BA31C1CC63E5043483FAE25F085E25E5 INSERT   41595370
2: BECE6374D91D47E6285EFDEBA6D65BB9 DATABASE   41595371
3:   26D695C8CA82CAFFDF985201F3AA44D7 UPDATE   41595282
4:   26D695C8CA82CAFFDF985201F3AA44D7 UPDATE   41595282
5: 2BC5A4199A0DDA16FA17A9CA1AA17C02 DATABASE   41595373
6:   6D944D54C54ED75D487288FE1505BB59 INSERT   41595368"))



read.table(text=gsub('(?<=\\:)\\s+|\\s+(?=\\s[0-9])', " '", lines, perl=TRUE), sep='')
#                                  verb_object SESSION_ID
#1:   BA31C1CC63E5043483FAE25F085E25E5 INSERT    41595370
#2: BECE6374D91D47E6285EFDEBA6D65BB9 DATABASE    41595371
#3:   26D695C8CA82CAFFDF985201F3AA44D7 UPDATE    41595282
#4:   26D695C8CA82CAFFDF985201F3AA44D7 UPDATE    41595282
#5: 2BC5A4199A0DDA16FA17A9CA1AA17C02 DATABASE    41595373
#6:   6D944D54C54ED75D487288FE1505BB59 INSERT    41595368

更新

OP的新数据集可以像以前一样用readLines读取,

lines <- readLines(textConnection("items newitem
1: BA31C1CC63E5043483FAE25F085E25E5 INSERT OV1
2: BECE6374D91D47E6285EFDEBA6D65BB9 DATABASE OV2
3: 26D695C8CA82CAFFDF985201F3AA44D7 UPDATE OV3
4: 2BC5A4199A0DDA16FA17A9CA1AA17C02 DATABASE OV4
5: 6D944D54C54ED75D487288FE1505BB59 INSERT OV5"))   

我们应该注意,我们在早期数据集(\\s+(?=\\s[0-9]))中匹配的模式在这里不起作用,因为'SESSIONID'中的第一个字符是数字,而在'newitem'中它是一个大写字母。因此,我们会匹配一个或多个不是:的字符,从字符串的开头(^[^:]+)后跟:,后跟一个或多个空格(\\s+) ,然后我们使用括号()将字符捕获为一组,即一个或多个不是空格的字符,后跟一个或多个空格,而字符不是空格(([^ ]+\\s+[^ ]+),匹配一个或多个空格({{ 1}})后跟一个或多个字符,直到字符串结尾为另一个捕获组(\\s+)。我们通过在第一个捕获组((.*)$)周围放置引号来替换,后跟空格第二个捕获组。

'\\1'