Question

我在过滤CSV文件时遇到了问题。

我的数据集及其数据结构有点像这样：

ID    Date        Time       Product_no  Product_IM_no  Pro_Name
(num) (Date)     (time)        (num)       (num)         (Char)
1    2-Oct-01  00:40 to 1:30   2152.5     71213.4        Aspire
1    2-Oct-01    02:10         21547.9    7122.3         Pla and Aspire ##Remove the row because Pro_name can have either Pluto or Aspire or Pla.
1    2.10.01     02:50         21537.9    7157.8         pluto

我的CSV文件中的问题：

我的日期格式不正确DD-MM-YY（根据R），在我的数据集某处，它是01年1月1日，某处01.01.01（与R要求的01-01-01不同）。
CSV中的时间格式也不正确00:00:00，在我的数据集中，其格式为00:00
有时我的时间是（00:00到10:01），我想过滤到00:00或删除整行。
Porduct_no，Product_IM_no和Pro_name的问题在于它们有时具有除给定数据结构之外的其他值。在这种情况下，它应该删除整行。

我有20个这样的CSV，每行约10k行。我需要解决这个问题。我怎样才能在R中解决这些问题？

谢谢。

Answer 1

您必须手动对它们进行维护。例如，为了转换您的时间，您可以使用strsplit功能根据＆＃34;到＆＃34;来分割：

gg <- strsplit("00:00 to 10:01", "to")
gg[[1]][1] #00:00 as an output

编辑：问题是正在读取文件吗？

ff <- read.csv("test2.csv", header = T)
ff <- as.data.frame(ff)
for (i in 1:ff <- length(ff$Time) {...}

Answer 2

我们可以subset数据集中包含＆＃39; Pro_Name＆＃39;中只有一个单词的行使用grep的列，第一次在＆＃39;时间＆＃39;使用sub，paste日期＆＃39;时间＆＃39;列，并使用guess_formats中的parse_date_time和lubridate，我们创建了一个＆＃39;日期时间＆＃39;栏目是＆＃39; POSIXct＆＃39;类。

library(lubridate) 
#subset the dataset rows
#we use \\w+ to match the a single word from the beginning of string'^'
#to the end of the string '$'.  If there are multiple words with spaces
#this returns FALSE.  Coupling it with `subset` will subset the TRUE rows. 
df2 <- subset(df1, grepl('^\\w+$', Pro_Name))
#remove the substring after the first space including the space using sub
df2$Time <- sub('\\s+.*$', '', df2$Time)
#paste the columns
v1 <- do.call(paste, df2[2:3])
#create new columns
df2$DateTime <- parse_date_time(v1, guess_formats(v1, c('dBy hm', 'dmy hm')))
#if we don't want to keep the original 'Date' and 'Time'
#we can remove that as well using `setdiff` on the column names of 'df2'
#and the columns that we don't want in the output
df2[setdiff(names(df2), c('Date', 'Time'))]
#  ID Product_no Product_IM_no Pro_Name            DateTime
#1  1     2152.5       71213.4   Aspire 2001-10-02 00:40:00
#2  1     2152.5       71213.4   Aspire 2001-10-02 01:30:00
#4  1    21537.9        7157.8    pluto 2001-10-02 02:50:00

数据

df1 <-structure(list(ID = c(1L, 1L, 1L, 1L), Date = c("2-Oct-01",
"2-Oct-01", 
"2-Oct-01", "2.10.01"), Time = c("00:40  to 1:30", "01:30", "02:10", 
"02:50"), Product_no = c(2152.5, 2152.5, 21547.9, 21537.9),
 Product_IM_no = c(71213.4, 
 71213.4, 7122.3, 7157.8), Pro_Name = c("Aspire", "Aspire",
 "Pla and Aspire", 
"pluto")), .Names = c("ID", "Date", "Time", "Product_no", 
"Product_IM_no", 
"Pro_Name"), class = "data.frame", row.names = c(NA, -4L))

在R中过滤/验证CSV文件

2 个答案:

数据