在我的数据中,在某些月份存在对某些ID的观察,而对其他ID则没有,例如</ p>
dat <- data.frame(c(1, 1, 1, 2, 3, 3, 3, 4, 4, 4), c(rep(30, 2), rep(25, 5), rep(20, 3)), c('2017-01-01', '2017-02-01', '2017-04-01', '2017-02-01', '2017-01-01', '2017-02-01', '2017-03-01', '2017-01-01',
'2017-02-01', '2017-04-01'))
colnames(dat) <- c('id', 'value', 'date')
我想为每个id
值插入一行,其中包含id
和NA
value
所缺少的月份。
有没有办法(有点)简明地在seq(min(as.Date(dat$date)), max(as.Date(dat$date)), by = 'months')
中的所有月份执行此操作?我经常使用tidyverse和data.table,但我对任何方法都持开放态度。
答案 0 :(得分:6)
tidyr::complete()
填充缺失值
添加id
和date
作为展开...
library(tidyverse)
complete(dat, id, date)
# A tibble: 16 x 3
id date value
<dbl> <date> <dbl>
1 1.00 2017-01-01 30.0
2 1.00 2017-02-01 30.0
3 1.00 2017-03-01 NA
4 1.00 2017-04-01 25.0
5 2.00 2017-01-01 NA
6 2.00 2017-02-01 25.0
7 2.00 2017-03-01 NA
8 2.00 2017-04-01 NA
9 3.00 2017-01-01 25.0
10 3.00 2017-02-01 25.0
11 3.00 2017-03-01 25.0
12 3.00 2017-04-01 NA
13 4.00 2017-01-01 20.0
14 4.00 2017-02-01 20.0
15 4.00 2017-03-01 NA
16 4.00 2017-04-01 20.0
答案 1 :(得分:2)
library(tidyverse)
dat %>%
group_by(id) %>%
complete(date) %>%
ungroup()
id date value
<dbl> <fct> <dbl>
1 1.00 2017-01-01 30.0
2 1.00 2017-02-01 30.0
3 1.00 2017-03-01 NA
4 1.00 2017-04-01 25.0
5 2.00 2017-01-01 NA
6 2.00 2017-02-01 25.0
7 2.00 2017-03-01 NA
8 2.00 2017-04-01 NA
9 3.00 2017-01-01 25.0
10 3.00 2017-02-01 25.0
11 3.00 2017-03-01 25.0
12 3.00 2017-04-01 NA
13 4.00 2017-01-01 20.0
14 4.00 2017-02-01 20.0
15 4.00 2017-03-01 NA
16 4.00 2017-04-01 20.0
根据您的示例数据进行操作:
import chardet
def loadRawData(infile):
rawdata = open(infile, 'rb').read()
return rawdata
def checkIsUtf8(data):
result = chardet.detect(data)
charenc = result['encoding']
print (charenc)
if charenc != 'utf8' and charenc != 'utf-8':
return [False, charenc]
else:
return [True, charenc]
def convertToUtf8(data, charenc):
data = data.decode(charenc).encode("utf-8")
return data
def loadData(infile):
data = loadRawData(infile)
[is_utf8, charenc] = checkIsUtf8(data)
if not is_utf8:
data = convertToUtf8(data, charenc)
return data
data_1 = loadData(fin_set1)
答案 2 :(得分:1)
以下是使用expand.grid
和merge
:
dat <- data.frame(c(1, 1, 1, 2, 3, 3, 3, 4, 4, 4), c(rep(30, 2), rep(25, 5), rep(20, 3)), as.Date(c('2017-01-01', '2017-02-01', '2017-04-01', '2017-02-01', '2017-01-01', '2017-02-01', '2017-03-01', '2017-01-01',
'2017-02-01', '2017-04-01')))
colnames(dat) <- c('id', 'value', 'date')
date_range <- seq(min(as.Date(dat$date)), max(as.Date(dat$date)), by = 'months')
dat_expanded <- expand.grid(date_range, dat$id)
colnames(dat_expanded) <- c("date", "id")
result <- merge(dat, dat_expanded, by=c("id", "date"), all.y = T)