我有一个df,它提供有关给定ID的create_date和delete_date(如果有)的信息。
结构:
ID create_date1 create_date2 delete_date1 delete_date2
1 01-01-2014 NA NA NA
2 01-04-2014 01-08-2014 01-05-2014 NA
规则/逻辑:
预期输出
ID 2014-01 2014-02 2014-03 2014-04 2014-05 2014-06 2014-07 2014-08
1 1 1 1 1 1 1 1 1
2 0 0 0 1 1 0 0 1
所以直到当前日期
1表示用户在该月充电/有效
问题:
我一直在努力做到这一点,但甚至无法理解如何做到这一点。我之前的方法有点太慢了
以前的解决方案:
将数据集设为高
将每个ID的日期序列作为新列插入
其他与滞后(状态)相同
ID create_date delete_date sequence status?
1 01-01-2014 NA 2014-01 1
1 01-01-2014 NA 2014-02 1
1 01-01-2014 NA 2014-03 1
答案 0 :(得分:2)
可能效率不高:假设这只是一年(可以轻松扩展)
# convert all dates to Date format
df[,colnames(df[-1])] = lapply(colnames(df[-1]), function(x) as.Date(df[[x]], format = "%d-%m-%Y"))
# extract the month
library(lubridate)
df[,colnames(df[-1])] = lapply(colnames(df[-1]), function(x) month(df[[x]]))
# df
# ID create_date1 create_date2 delete_date1 delete_date2
#1 1 1 NA NA NA
#2 2 4 8 5 NA
# get the current month
current.month <- month(Sys.Date())
# assume for now current month is 9
current.month <- 9
flags <- rep(FALSE, current.month)
func <- function(x){
x[is.na(x)] <- current.month # replacing all NA with current month(9)
create.columns.indices <- x[grepl("create_date", colnames(df[-1]))] # extract the create_months
delete.columns.indices <- x[grepl("delete_date", colnames(df[-1]))] # extract the delete_months
flags <- pmin(1,colSums(t(sapply(seq_along(create.columns.indices),
function(x){
flags[create.columns.indices[x]:delete.columns.indices[x]] = TRUE;
flags
}))))
flags
}
df1 = cbind(df$ID , t(apply(df[-1], 1, func)))
colnames(df1) = c("ID", paste0("month",1:current.month))
# df1
# ID month1 month2 month3 month4 month5 month6 month7 month8 month9
#[1,] 1 1 1 1 1 1 1 1 1 1
#[2,] 2 0 0 0 1 1 0 0 1 1
答案 1 :(得分:1)
这是一个相当长的整齐的方法:
import json
import scrapy
class TaleoSpider(scrapy.Spider):
name = 'taleo'
start_urls = ['https://ngc.taleo.net/careersection/ngc_pro/jobsearch.ftl?lang=en#']
# baseform with base search values
base_form = {'advancedSearchFiltersSelectionParam':
{'searchFilterSelections': [
{'id': 'ORGANIZATION', 'selectedValues': []},
{'id': 'LOCATION', 'selectedValues': []},
{'id': 'JOB_FIELD', 'selectedValues': []},
{'id': 'URGENT_JOB', 'selectedValues': []},
{'id': 'EMPLOYEE_STATUS', 'selectedValues': []},
{'id': 'STUDY_LEVEL', 'selectedValues': []},
{'id': 'WILL_TRAVEL', 'selectedValues': []},
{'id': 'JOB_SHIFT', 'selectedValues': []},
{'id': 'JOB_NUMBER', 'selectedValues': []}]},
'fieldData': {'fields': {'JOB_TITLE': '', 'KEYWORD': '', 'LOCATION': ''},
'valid': True},
'filterSelectionParam': {'searchFilterSelections': [{'id': 'POSTING_DATE',
'selectedValues': []},
{'id': 'LOCATION', 'selectedValues': []},
{'id': 'JOB_FIELD', 'selectedValues': []},
{'id': 'JOB_TYPE', 'selectedValues': []},
{'id': 'JOB_SCHEDULE', 'selectedValues': []},
{'id': 'JOB_LEVEL', 'selectedValues': []}]},
'multilineEnabled': False,
'pageNo': 1, # <--- change this for pagination
'sortingSelection': {'ascendingSortingOrder': 'false',
'sortBySelectionParam': '3'}}
def parse(self, response):
# we got cookies from first start url now lets request into the search api
# copy base form for the first request
form = self.base_form.copy()
yield scrapy.Request('https://ngc.taleo.net/careersection/rest/jobboard/searchjobs?lang=en&portal=2160420105',
body=json.dumps(self.base_form),
# add headers to indicate we are sending a json package
headers={'Content-Type': 'application/json',
'X-Requested-With': 'XMLHttpRequest'},
# scrapy.Request defaults to 'GET', but we want 'POST' here
method='POST',
# load our form into meta so we can reuse it later
meta={'form': form},
callback=self.parse_items)
def parse_items(self, response):
data = json.loads(response.body)
# scrape data
for item in data['requisitionList']:
yield item
# next page
# get our form back and update the page number in it
form = response.meta['form']
form['pageNo'] += 1
# check if paging is over, is our next page higher than maximum page?
max_page = data['pagingData']['totalCount'] / data['pagingData']['pageSize']
if form['pageNo'] > max_page:
return
yield scrapy.Request('https://ngc.taleo.net/careersection/rest/jobboard/searchjobs?lang=en&portal=2160420105',
body=json.dumps(form),
headers={'Content-Type': 'application/json',
'X-Requested-With': 'XMLHttpRequest'},
method='POST',
meta={'form': form},
callback=self.parse_items)