使用R从大脏数据中提取1,000,000个样本

时间:2019-02-21 20:13:55

标签: r

我正在尝试从825MB(太大而无法完全导出)的.CSV文件中提取具有15列的100万行的样本。

数据示例如下所示:

2016-10-31,2016-12-31,OEM,GRILLE,Grille,Grille,F231062J00,NISS-GRP,G20 AUTOMATIC W/TOURING PKG,7+ YEARS,AZ,Western,GRILLE SET-RADIATOR,1,255.09,255.09
2016-10-31,2016-12-31,OEM,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER,7+ YEARS,TN,South Central,SERVICE FILE  FENDER-REAR,LH,1,1076.49,1076.49
2016-07-31,2016-09-30,OEM,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER,7+ YEARS,WA,Western,SERVICE FILE  FENDER-REAR,LH,1,1076.49,1076.49
2016-04-30,2016-06-30,OEM,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER,7+ YEARS,ME,Eastern,SERVICE FILE  FENDER-REAR,LH,1,1108.79,1108.79
2016-10-31,2016-12-31,OEM,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER,7+ YEARS,CA,Western,SERVICE FILE  FENDER-REAR,LH,1,1076.49,1076.49
2016-07-31,2016-09-30,OEM,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER,7+ YEARS,MO,Central,SERVICE FILE  FENDER-REAR,LH,1,1076.49,1076.49
2016-04-30,2016-06-30,OEM,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER,7+ YEARS,OH,Central,SERVICE FILE  FENDER-REAR,LH,1,1022.67,1022.67
2016-07-31,2016-09-30,OEM,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER SE 4WD,7+ YEARS,CT,Eastern,SERVICE FILE  FENDER-REAR,LH,1,1076.49,1076.49
2016-10-31,2016-12-31,OEM,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER SE 4WD,7+ YEARS,NJ,Eastern,SERVICE FILE  FENDER-REAR,LH,1,1076.49,1076.49
2016-10-31,2016-12-31,OEM,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER SE 4WD,7+ YEARS,PA,Eastern,SERVICE FILE  FENDER-REAR,LH,1,1076.49,1076.49
2016-10-31,2016-12-31,OEM,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER SE 4WD,7+ YEARS,OR,Western,SERVICE FILE  FENDER-REAR,LH,1,1076.49,1076.49
2016-04-30,2016-06-30,OEM,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER,7+ YEARS,MA,Eastern,SERVICE FILE  FENDER-REAR,LH,3,3261.77,1087.26
2017-01-31,2017-03-31,OEM,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER SE 4WD,7+ YEARS,MA,Eastern,SERVICE FILE  FENDER-REAR,LH,2,2152.98,1076.49
2016-04-30,2016-06-30,OEM,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER,7+ YEARS,NJ,Eastern,SERVICE FILE  FENDER-REAR,LH,3,3229.47,1076.49
2016-04-30,2016-06-30,OEM,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER,7+ YEARS,GA,Southern,SERVICE FILE  FENDER-REAR,LH,1,1076.49,1076.49
2016-07-31,2016-09-30,OEM,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER,7+ YEARS,FL,Southern,SERVICE FILE  FENDER-REAR,LH,1,1076.49,1076.49
2016-10-31,2016-12-31,OEM,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER,7+ YEARS,TX,South Central,SERVICE FILE  FENDER-REAR,LH,1,1076.49,1076.49
2016-07-31,2016-09-30,OEM,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER,7+ YEARS,CA,Western,SERVICE FILE  FENDER-REAR,LH,1,1076.49,1076.49
2016-04-30,2016-06-30,OEM,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER,7+ YEARS,NY,Eastern,SERVICE FILE  FENDER-REAR,LH,1,1076.49,1076.49
2016-07-31,2016-09-30,OEM,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER,7+ YEARS,CO,Western,SERVICE FILE  FENDER-REAR,LH,2,2152.98,1076.49
2016-04-30,2016-06-30,OEM,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER,7+ YEARS,UT,Western,SERVICE FILE  FENDER-REAR,LH,1,1076.49,1076.49
2016-07-31,2016-09-30,OEM,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER,7+ YEARS,GA,Southern,SERVICE FILE  FENDER-REAR,LH,1,1076.49,1076.49
2016-04-30,2016-06-30,OEM,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER,7+ YEARS,CA,Western,SERVICE FILE  FENDER-REAR,LH,1,1108.79,1108.79
2016-07-31,2016-09-30,OEM,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER,7+ YEARS,VA,Southern,SERVICE FILE  FENDER-REAR,LH,1,1076.49,1076.49
2016-10-31,2016-12-31,RECYCLED,TRUNK LID,Spoiler assy,Spoiler assy graphite,K6030AM817,NISS-GRP,G35 COUPE AUTOMATIC W/LEATHER,7+ YEARS,CA,Western,G35 COUPE REAR SPOILER-WV2,1,200.00,200.00
2016-10-31,2016-12-31,RECYCLED,WHEELS,Wheel, alloy,Wheel, alloy type 1 17" wheel,D03004Y91A,NISS-GRP,MAXIMA SE 20TH ANNIVERSARY EDITION AUTOMATIC,7+ YEARS,MA,Eastern,ALUMINUM WHEEL,3,318.75,106.25
2017-01-31,2017-03-31,RECYCLED,WHEELS,Wheel, alloy,Wheel, alloy type 1 17" wheel,D03004Y91A,NISS-GRP,MAXIMA SE 20TH ANNIVERSARY EDITION AUTOMATIC,7+ YEARS,FL,Southern,ALUMINUM WHEEL,1,375.00,375.00
2016-08-31,2016-10-31,RECYCLED,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER,7+ YEARS,MD,Southern,SERVICE FILE  FENDER-REAR,LH,1,312.50,312.50
2016-05-31,2016-07-31,RECYCLED,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER,7+ YEARS,VA,Southern,SERVICE FILE  FENDER-REAR,LH,1,468.75,468.75
2016-08-31,2016-10-31,RECYCLED,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER,7+ YEARS,WA,Western,SERVICE FILE  FENDER-REAR,LH,1,268.75,268.75
2016-05-31,2016-07-31,RECYCLED,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER,7+ YEARS,MA,Eastern,SERVICE FILE  FENDER-REAR,LH,1,312.50,312.50
2017-02-28,2017-04-30,RECYCLED,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER,7+ YEARS,FL,Southern,SERVICE FILE  FENDER-REAR,LH,1,625.00,625.00
2016-11-30,2017-01-31,RECYCLED,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER,7+ YEARS,FL,Southern,SERVICE FILE  FENDER-REAR,LH,1,300.00,300.00
2016-10-31,2016-12-31,RECYCLED,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER,7+ YEARS,CO,Western,SERVICE FILE  FENDER-REAR,LH,1,287.50,287.50
2016-10-31,2016-12-31,RECYCLED,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER,7+ YEARS,WV,Southern,SERVICE FILE  FENDER-REAR,LH,1,375.00,375.00
2016-10-31,2016-12-31,RECYCLED,QUARTER PANEL,Quarter panel,Quarter panel w/o rear spare carrier SE & LE,G81012W730,NISS-GRP,PATHFINDER,7+ YEARS,NY,Eastern,SERVICE FILE  FENDER-REAR,LH,1,437.50,437.50

主要问题是它的某些行具有超过15列(脏)的列。我正在使用以下代码行:

library(sqldf)
DF <- read.csv.sql("CCC_Data.csv", sql = "select * from file order by random() limit 1000000")

但是我得到了错误:

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : line 24 did not have 17 elements

我该如何解决?

谢谢!

3 个答案:

答案 0 :(得分:0)

您可以尝试用NA填充缺失的列

对于提供的示例数据(没有标题),下面的代码将数据读取到data.table中。

library( data.table )
dt <- data.table::fread( "./test.csv", header = FALSE, fill = TRUE )

使用nrow = 1000000将读取限制为一百万行

答案 1 :(得分:0)

我用read_csv读了您的数据,它发出警告而不是错误。您在移帧列的数据中缺少定界符。在这个论坛上比我更多的合格人员可能会给出更好的答案,但是我的方法是:

  1. 在导入后,查找一列可预测地不一致的列。意味着您可以编写一些逻辑来过滤出不一致的行。从上面的数据看来,第1、26和27行是候选行。

  2. 过滤掉这些行,以便您也许可以分析剩余数据

  3. 仅过滤这些行,以便您可以查找要替换为新列的逻辑,然后组合回到步骤2。

tidyverse动词filterselectmutate应该可以帮助您。

答案 2 :(得分:0)

添加filter参数以仅保留具有17个字段的行。这假设您在PATH上有gawk,或者如果它在路径上但没有,则可以包括其绝对路径。在Windows上,您可能需要安装RTools才能获得gawk

如果没有标题,您可能还需要header = FALSE。如果行尾与您的平台不符,则可能需要eol="\n"eol="\r\n"

DF <- read.csv.sql("CCC_Data.csv", 
  sql = "select * from file order by random() limit 1000000",
  filter = 'gawk "NF==17" FS=,')