根据多列

时间:2017-01-06 04:54:27

标签: r dataframe

我有一个df,它提供有关给定ID的create_date和delete_date(如果有)的信息。

结构:

ID create_date1 create_date2 delete_date1  delete_date2
1  01-01-2014   NA           NA            NA    
2  01-04-2014   01-08-2014   01-05-2014    NA
  • create_date和delete_date一直延伸到10,即create_date10 和delete_date10列存在

规则/逻辑:

  • 如果用户是在一个月的30日创建的,我们会按月向用户收取费用,即使这样,用户也会被视为用户活动一个月(费用非常低)
  • 如果用户在本月有删除日期(无论在哪个日期),则从下个月开始不向用户收费
  • 如果用户只有create_date且没有delete_date,则会收取包括create_month在内的所有日期

预期输出

ID 2014-01 2014-02 2014-03 2014-04 2014-05 2014-06 2014-07 2014-08
1  1       1       1       1       1       1       1       1
2  0       0       0       1       1       0       0       1
  • 所以直到当前日期

  • 1表示用户在该月充电/有效

问题:

我一直在努力做到这一点,但甚至无法理解如何做到这一点。我之前的方法有点太慢了

以前的解决方案:

  1. 将数据集设为高

  2. 将每个ID的日期序列作为新列插入

  3. 使用for循环检查状态
  4. 对于每个ID,状态等于1,
  5. 如果create_date等于序列,则如果lag(delete_date)等于序列则为0
  6. 其他与滞后(状态)相同

    ID create_date  delete_date sequence  status?
    1  01-01-2014   NA          2014-01   1
    1  01-01-2014   NA          2014-02   1
    1  01-01-2014   NA          2014-03   1
    

2 个答案:

答案 0 :(得分:2)

可能效率不高:假设这只是一年(可以轻松扩展)

# convert all dates to Date format
df[,colnames(df[-1])] = lapply(colnames(df[-1]), function(x) as.Date(df[[x]], format = "%d-%m-%Y"))
# extract the month
library(lubridate)
df[,colnames(df[-1])] = lapply(colnames(df[-1]), function(x) month(df[[x]]))
# df
#  ID create_date1 create_date2 delete_date1 delete_date2
#1  1            1           NA           NA           NA
#2  2            4            8            5           NA

# get the current month 
current.month <- month(Sys.Date())
# assume for now current month is 9
current.month <- 9

flags <- rep(FALSE, current.month)

func <- function(x){
  x[is.na(x)] <- current.month     # replacing all NA with current month(9)
  create.columns.indices <- x[grepl("create_date", colnames(df[-1]))] # extract the create_months
  delete.columns.indices <- x[grepl("delete_date", colnames(df[-1]))] # extract the delete_months
  flags <- pmin(1,colSums(t(sapply(seq_along(create.columns.indices), 
                            function(x){
                                         flags[create.columns.indices[x]:delete.columns.indices[x]] = TRUE;
                                         flags
                                        }))))
  flags
}
df1 = cbind(df$ID , t(apply(df[-1], 1, func)))
colnames(df1) = c("ID", paste0("month",1:current.month))
# df1
#     ID month1 month2 month3 month4 month5 month6 month7 month8 month9
#[1,]  1      1      1      1      1      1      1      1      1      1
#[2,]  2      0      0      0      1      1      0      0      1      1

答案 1 :(得分:1)

这是一个相当长的整齐的方法:

import json
import scrapy


class TaleoSpider(scrapy.Spider):
    name = 'taleo'
    start_urls = ['https://ngc.taleo.net/careersection/ngc_pro/jobsearch.ftl?lang=en#']
    # baseform with base search values
    base_form = {'advancedSearchFiltersSelectionParam':
        {'searchFilterSelections': [
            {'id': 'ORGANIZATION', 'selectedValues': []},
            {'id': 'LOCATION', 'selectedValues': []},
            {'id': 'JOB_FIELD', 'selectedValues': []},
            {'id': 'URGENT_JOB', 'selectedValues': []},
            {'id': 'EMPLOYEE_STATUS', 'selectedValues': []},
            {'id': 'STUDY_LEVEL', 'selectedValues': []},
            {'id': 'WILL_TRAVEL', 'selectedValues': []},
            {'id': 'JOB_SHIFT', 'selectedValues': []},
            {'id': 'JOB_NUMBER', 'selectedValues': []}]},
        'fieldData': {'fields': {'JOB_TITLE': '', 'KEYWORD': '', 'LOCATION': ''},
                      'valid': True},
        'filterSelectionParam': {'searchFilterSelections': [{'id': 'POSTING_DATE',
                                                             'selectedValues': []},
                                                            {'id': 'LOCATION', 'selectedValues': []},
                                                            {'id': 'JOB_FIELD', 'selectedValues': []},
                                                            {'id': 'JOB_TYPE', 'selectedValues': []},
                                                            {'id': 'JOB_SCHEDULE', 'selectedValues': []},
                                                            {'id': 'JOB_LEVEL', 'selectedValues': []}]},
        'multilineEnabled': False,
        'pageNo': 1,  # <--- change this for pagination
        'sortingSelection': {'ascendingSortingOrder': 'false',
                             'sortBySelectionParam': '3'}}

    def parse(self, response):
        # we got cookies from first start url now lets request into the search api
        # copy base form for the first request
        form = self.base_form.copy()
        yield scrapy.Request('https://ngc.taleo.net/careersection/rest/jobboard/searchjobs?lang=en&portal=2160420105',
                             body=json.dumps(self.base_form),
                             # add headers to indicate we are sending a json package
                             headers={'Content-Type': 'application/json',
                                      'X-Requested-With': 'XMLHttpRequest'},
                             # scrapy.Request defaults to 'GET', but we want 'POST' here
                             method='POST',
                             # load our form into meta so we can reuse it later
                             meta={'form': form},
                             callback=self.parse_items)

    def parse_items(self, response):
        data = json.loads(response.body)
        # scrape data
        for item in data['requisitionList']:
            yield item

        # next page
        # get our form back and update the page number in it
        form = response.meta['form']
        form['pageNo'] += 1
        # check if paging is over, is our next page higher than maximum page?
        max_page = data['pagingData']['totalCount'] / data['pagingData']['pageSize']
        if form['pageNo'] > max_page:
            return
        yield scrapy.Request('https://ngc.taleo.net/careersection/rest/jobboard/searchjobs?lang=en&portal=2160420105',
                             body=json.dumps(form),
                             headers={'Content-Type': 'application/json',
                                      'X-Requested-With': 'XMLHttpRequest'},
                             method='POST',
                             meta={'form': form},
                             callback=self.parse_items)