Question

我有两个数据集。 FIRST是供应商的产品及其每日价格列表，SECOND是开始日期和结束日期列表（以及其他重要的分析数据）。如何告知Stata将开始日期的价格和结束日期的价格从FIRST提取到SECOND给定日期。请注意，如果没有完全匹配的日期，我希望它能抓住最后一个日期。例如，如果SECOND的日期为1/1/2013且FIRST的价格为... 12/30 / 2012,12 / 31 / 2012,1 / 2/2013，......它会抢到12/31/2012的价格。

我通常会用Excel做这件事，但我有数以百万计的观察结果，这是不可行的。

我举了一个FIRST和SECOND的示例，以及最优解决方案作为输出POST_SECOND

的内容

FIRST
 Product          Price              Date
   1               3                1/1/2010
   1               3                1/3/2010
   1               4                1/4/2010
   1               2                1/8/2010
   2               1                1/1/2010
   2               5                2/5/2010
   3               7                12/26/2009
   3               2                1/1/2010
   3               6                4/3/2010

SECOND
Product          Start Date          End Date
   1              1/3/2010            1/4/2010
   2              1/1/2010            1/1/2010
   3              12/26/2009          4/3/2010

POST_SECOND
 Product         Start Date          End Date      Price_Start     Price_End
   1              1/3/2010            1/4/2010          3             4
   2              1/1/2010            1/1/2010          1             1
   3              12/26/2009          4/3/2010          7             6

Answer 1

这是一个依赖于使用 last 日期的合并/保留/排序/折叠*解决方案。我稍微改变了你的示例数据。

/* Make Fake Data & Convert Dates to Date Format */
clear
input byte Product         byte Price            str12  str_date
   1               3                "1/1/2010"
   1               3                "1/3/2010"
   1               4                "1/4/2010"
   1               2                "1/8/2010"
   2               1                "1/1/2010"
   2               5                "2/5/2010"
   3               7                "12/26/2009"
   3               7                "12/28/2009"
   3               2                "1/1/2010"
   3               6                "4/3/2010"
   4               8                "12/30/2012"
   4               9                "12/31/2012"
   4               10               "1/2/2013"  
   4               10               "1/3/2013"  
 end

gen Date = date(str_date,"MDY")
format Date %td
drop str_date    
save "First.dta", replace

clear 
input byte Product          str12 str_Start_Date        str12  str_End_Date
   1              "1/3/2010"            "1/4/2010"
   2              "1/1/2010"            "1/1/2010"
   3              "12/27/2009"          "4/3/2010"
   4              "1/1/2013"            "1/2/2013"
end

gen Start_Date = date(str_Start_Date,"MDY")
gen End_Date = date(str_End_Date,"MDY")
format Start_Date End_Date %td
drop str_*
save "Second.dta", replace

/* Data Transformation */
use "First.dta", clear
merge m:1 Product using "Second.dta", nogen

bys Product: egen ads = min(abs(Start_Date-Date))
bys Product: egen ade = min(abs(End_Date - Date))
keep if (ads==abs(Date - Start_Date) & Date <= Start_Date) | (ade==abs(Date - End_Date) & Date <= End_Date)
sort Product Date
collapse (first) Price_Start = Price (last) Price_End = Price, by(Product Start_Date End_Date)
list, clean noobs

*有些人是重塑者。其他人则是崩溃者。通常两者都可以完成工作，但我认为在这种情况下崩溃更容易。

Answer 2

在Stata中，我从来没有能够在一步中很好地工作（你可以通过SQL调用在SAS中完成）。在任何情况下，我认为您最好从FIRST.dta创建一个中间文件，然后在StartDate中的每个EndDate和SECOND.dta变量上合并2x。

假设您有从2010年1月1日到2013年12月31日的价格调整数据（如上所示，以不同的时间间隔指定）。我假设所有日期变量都在date format中的FIRST.dta中。 SECOND.dta，SECOND中的变量名称中没有空格。

tempfile prod prices

use FIRST.dta, clear
keep Product
duplicates drop
save `prod'

clear
set obs 1096
g Date=date("12-31-2009","MDY")+_n
format date %td
cross using `prod'

merge 1:1 Product Date using FIRST.dta, assert(1 3) nogen
gsort +Product +Date /*this ensures the data are sorted properly for the next step  */
replace price=price[_n-1] if price==. & Product==Product[_n-1]
save `prices'

use SECOND.dta, clear
foreach i in Start End {
rename `i'Date Date
merge 1:1 Product Date using `prices', assert(2 3) keep(3) nogen
rename Price Price_`i'
rename Date `i'Date 
}

如果我正确理解您的数据结构，这应该有效，它应该解决@ Dimitriy回答的评论中讨论的问题。我很乐意批评如何使这个更好，因为我必须做几次，这就是我通常去做的事情。

合并数据以运行特定的个人分析

2 个答案: