我一直在excel中打开一个.csv文件(来自MS SQL 2012),并使用公式 我的数据从300K行跳到3.5mm,不再适合。 (提示笑声)
我一直在玩R,并仔细研究了dplyr的变异 但是,我需要做的事情似乎比R的令人敬畏的数据操作更进一步。
我根据逻辑操作在下一行添加新列,有时是数字,有时是字符串。
我是一个python newb,并且有预感它可能是比这个特定任务的R更好的工具,也许不是。
我全身都搜索过,但仍未找到与我面临的问题相似的例子。
我曾经放弃过这个source.csv
id,event,eventDate,direction
id1,apple,1977-06-26 00:00:00.000,positive
id1,apple,1980-07-01 00:00:00.000,positive
id1,candy,1980-05-01 00:00:00.000,negative
id1,apple,1980-11-21 00:00:00.000,positive
id2,fruit,1980-06-26 00:00:00.000,positive
id2,cookie,1990-06-26 00:00:00.000,negative
id2,cavity,1991-07-15 00:00:00.000,negative
id2,apple,1991-07-16 00:00:00.000,positive
id2,apple,1997-01-16 00:00:00.000,positive
id3,cookie,2010-04-20 00:00:00.000,negative
id4,cookie,2010-04-20 00:00:00.000,negative
id4,cookie,2010-04-20 00:00:01.000,negative
并创建此output.csv
id,event,eventDate,direction,idEventNumber,nextEvent,daysUntilNextEvent
id1,apple,1977-06-26 00:00:00.000,positive,1000,negative,1040
id1,apple,1980-07-01 00:00:00.000,positive,1001,positive,143
id1,candy,1980-05-01 00:00:00.000,negative,1002,positive,61
id1,apple,1980-11-21 00:00:00.000,positive,1003,noFurtherEvent,-1
id2,fruit,1980-06-26 00:00:00.000,positive,1000,negative,3652
id2,cookie,1990-06-26 00:00:00.000,negative,1001,negative,384
id2,cavity,1991-07-15 00:00:00.000,negative,1002,positive,1
id2,apple,1991-07-16 00:00:00.000,positive,1003,positive,2011
id2,apple,1997-01-16 00:00:00.000,positive,1004,noFurtherEvent,-1
id3,cookie,2010-04-20 00:00:00.000,negative,1000,noFurtherEvent,-1
id4,cookie,2010-04-20 00:00:00.000,negative,1000,negative,0
id4,cookie,2010-04-20 00:00:01.000,negative,1001,noFurtherEvent,-1
我的新专栏将是
-number事件(从1000开始,检查下一行的id是否匹配,如果是,则添加一个,否则重启@ 1000)
-copy下一个事件,如果存在的话
-count daysUntilNextEvent(mssql datetime输出之间的数学,没有小数天,-1为最后一个事件)
你会如何解决这个问题?
感谢您的时间|思考|鼓励|指针|示例。
更正:上面的原始output.csv示例包含错误,示例已更正,但这是在许多快速响应之后发生的,因此为什么他们的正确问题和评论现在看起来不合适。
答案 0 :(得分:3)
以下是我使用data.table
:
require(data.table) ## 1.9.4+
DT = fread("input.csv")[, eventDate := as.Date(eventDate)] ## -(1)
DT[order(id, eventDate), ## -(2)
`:=`(idEventNumber = seq.int(1000L, length.out=.N),
nextEvent = c(tail(direction, -1L), "noFurtherEvent"),
daysUntilNextEvent = c(diff(eventDate), -1L)),
by=id]
1 ..首先,我们使用fread
- 快速文件阅读器来阅读csv
并将eventDate
从character
转换为Date
格式。< / p>
然后我们按id, eventDate
订购,以便日期按递增顺序排列,按此顺序,我们按id
分组,并按引用添加三列 - 也就是说,将这些列添加到DT
就地。
idEventNumber
- 我们从1000
开始并继续将其递增到.N
的长度 - 这是一个特殊变量,用于保存每个组的观察数。 nextEvent
- 我们从direction
获取所有值,但该群组的第一个除外,并添加noFurtherEvent
作为最后一个值。daysUntilNextEvent
- 我们对该组的所有eventDate
值进行区分,并将-1L
添加到最后一个观察点。 请注意,保留输入顺序,而天数按正确顺序计算。
这是输出:
# id event eventDate direction idEventNumber nextEvent daysUntilNextEvent
# 1: id1 apple 1977-06-26 positive 1000 negative 1040
# 2: id1 apple 1980-07-01 positive 1002 positive 143
# 3: id1 candy 1980-05-01 negative 1001 positive 61
# 4: id1 apple 1980-11-21 positive 1003 noFurtherEvent -1
# 5: id2 fruit 1980-06-26 positive 1000 negative 3652
# 6: id2 cookie 1990-06-26 negative 1001 negative 384
# 7: id2 cavity 1991-07-15 negative 1002 positive 1
# 8: id2 apple 1991-07-16 positive 1003 positive 2011
# 9: id2 apple 1997-01-16 positive 1004 noFurtherEvent -1
# 10: id3 cookie 2010-04-20 negative 1000 noFurtherEvent -1
# 11: id4 cookie 2010-04-20 negative 1000 negative 0
# 12: id4 cookie 2010-04-20 negative 1001 noFurtherEvent -1
答案 1 :(得分:2)
您可以使用dplyr
中的R
执行此操作。如果您的数据框名为ana,则可以尝试以下操作。
library(dplyr)
ana %>%
mutate(group = cumsum(!duplicated(id)),
eventDate = as.Date(eventDate, format = "%Y-%m-%d"))%>%
arrange(id, eventDate) %>%
group_by(group) %>%
mutate(num = row_number() + 999,
nextEvent = lead(direction, default = "noFurtherEvent"),
daysUntilNextEvent = as.numeric(lead(eventDate) - eventDate),
daysUntilNextEvent = replace(daysUntilNextEvent, is.na(.), "-1"))
# id event eventDate direction group num nextEvent daysUntilNextEvent
#1 id1 apple 1977-06-26 positive 1 1000 negative 1040
#2 id1 candy 1980-05-01 negative 1 1001 positive 61
#3 id1 apple 1980-07-01 positive 1 1002 positive 143
#4 id1 apple 1980-11-21 positive 1 1003 noFurtherEvent -1
#5 id2 fruit 1980-06-26 positive 2 1000 negative 3652
#6 id2 cookie 1990-06-26 negative 2 1001 negative 384
#7 id2 cavity 1991-07-15 negative 2 1002 positive 1
#8 id2 apple 1991-07-16 positive 2 1003 positive 2011
#9 id2 apple 1997-01-16 positive 2 1004 noFurtherEvent -1
#10 id3 cookie 2010-04-20 negative 3 1000 noFurtherEvent -1
#11 id4 cookie 2010-04-20 negative 4 1000 negative 0
#12 id4 cookie 2010-04-20 negative 4 1001 noFurtherEvent -1
答案 2 :(得分:1)
您的输出样本未正确基于您的输入样本:“id1,apple,1980-07-01”在输入中为“正”,在输出中为“负”。考虑到这一点,这是PowerShell中的一个例子:
$sInFile = "infile.csv"
$sOutFile = "outfile.csv"
$cInTable = Import-Csv -Path $sInFile `
| Sort-Object -Property @("id", "eventDate")
$cOutTable = $cInTable
$oIdCounters = New-Object PSObject
for ($i = 0; $i -lt $cInTable.Count; $i++) {
if ([Int]$oIdCounters.($cInTable[$i].id) -lt 1000) {
$oIdCounters | Add-Member -MemberType "NoteProperty" `
-Name $cInTable[$i].id -Value 1000
} else {
$oIdCounters.($cInTable[$i].id) += 1
}
$cOutTable[$i] | Add-Member -MemberType "NoteProperty" `
-Name "idEventNumber" -Value $oIdCounters.($cInTable[$i].id)
}
for ($i = $cInTable.Count - 1; $i -ge 0; $i--) {
if ($cOutTable[$i].idEventNumber -eq $oIdCounters.($cInTable[$i].id)) {
$sNextEvent = "noFurtherEvent"
$iDaysUntilNextEvent = -1
} else {
$sNextEvent = $cInTable[$i+1].direction
$iDaysUntilNextEvent = ([DateTime]$cInTable[$i+1].eventDate -`
[DateTime]$cInTable[$i].eventDate).Days
}
$cOutTable[$i] | Add-Member -MemberType "NoteProperty" `
-Name "nextEvent" -Value $sNextEvent
$cOutTable[$i] | Add-Member -MemberType "NoteProperty" `
-Name "daysUntilNextEvent" -Value $iDaysUntilNextEvent
}
$cOutTable | Export-Csv -Path $sOutFile -NoTypeInformation
答案 3 :(得分:1)
我的方向略有不同。我将最后一个条目存储在变量中,然后在处理下一个条目时对其进行修改并传递,然后在ForEach循环后跟上最后一个条目。
$Results = @()
$IDCount=1000
$LastLine = $false
Import-CSV $InPath | sort id,eventdate | ForEach{
If($LastLine -and $LastLine.ID -eq $_.ID){
Add-Member -InputObject $LastLine -NotePropertyName 'IDEventNumber' -NotePropertyValue $IDCount
Add-Member -InputObject $LastLine -NotePropertyName 'nextEvent' -NotePropertyValue $_.Direction
$Results += Add-Member -InputObject $LastLine -NotePropertyName 'daysUntilNextEvent' -NotePropertyValue ([datetime]$_.EventDate - [datetime]$LastLine.EventDate|Select -Expand Days) -PassThru
$IDCount++
}ElseIf($LastLine){
$IDCount=1000
Add-Member -InputObject $LastLine -NotePropertyName 'IDEventNumber' -NotePropertyValue $IDCount
Add-Member -InputObject $LastLine -NotePropertyName 'nextEvent' -NotePropertyValue 'NoFurtherEvent'
$Results += Add-Member -InputObject $LastLine -NotePropertyName 'daysUntilNextEvent' -NotePropertyValue '-1' -PassThru}
$LastLine = $_}
Add-Member -InputObject $LastLine -NotePropertyName 'IDEventNumber' -NotePropertyValue $IDCount
Add-Member -InputObject $LastLine -NotePropertyName 'nextEvent' -NotePropertyValue 'NoFurtherEvent'
$Results += Add-Member -InputObject $LastLine -NotePropertyName 'daysUntilNextEvent' -NotePropertyValue '-1' -PassThru
$Results | Export-CSV $OutPath -NoTypeInformation
输出是:
"id","event","eventDate","direction","IDEventNumber","nextEvent","daysUntilNextEvent"
"id1","apple","1977-06-26 00:00:00.000","positive","1000","negative","1040"
"id1","candy","1980-05-01 00:00:00.000","negative","1001","positive","61"
"id1","apple","1980-07-01 00:00:00.000","positive","1002","positive","143"
"id1","apple","1980-11-21 00:00:00.000","positive","1000","NoFurtherEvent","-1"
"id2","fruit","1980-06-26 00:00:00.000","positive","1000","negative","3652"
"id2","cookie","1990-06-26 00:00:00.000","negative","1001","negative","384"
"id2","cavity","1991-07-15 00:00:00.000","negative","1002","positive","1"
"id2","apple","1991-07-16 00:00:00.000","positive","1003","positive","2011"
"id2","apple","1997-01-16 00:00:00.000","positive","1000","NoFurtherEvent","-1"
"id3","cookie","2010-04-20 00:00:00.000","negative","1000","NoFurtherEvent","-1"
"id4","cookie","2010-04-20 00:00:00.000","negative","1000","negative","0"
"id4","cookie","2010-04-20 00:00:01.000","negative","1001","NoFurtherEvent","-1"
答案 4 :(得分:1)
这是我在python中的解决方案:
from datetime import datetime, timedelta
_data = '''id1,apple,1977-06-26 00:00:00.000,positive
id1,apple,1980-07-01 00:00:00.000,positive
id1,candy,1980-05-01 00:00:00.000,negative
id1,apple,1980-11-21 00:00:00.000,positive
id2,fruit,1980-06-26 00:00:00.000,positive
id2,cookie,1990-06-26 00:00:00.000,negative
id2,cavity,1991-07-15 00:00:00.000,negative
id2,apple,1991-07-16 00:00:00.000,positive
id2,apple,1997-01-16 00:00:00.000,positive
id3,cookie,2010-04-20 00:00:00.000,negative
id4,cookie,2010-04-20 00:00:00.000,negative
id4,cookie,2010-04-20 00:00:01.000,negative'''
我首先创建一个带有ids
的字典作为键,其中包含该ID的项目列表:
data = {}
for line in _data.split('\n'):
fields = line.split(',')
data.setdefault(fields[0], []).append(fields[1:])
然后我以sorted()顺序遍历此dict以保留id的顺序。对于每个id,我创建一个由一对行或一行组成的新列表。对于每个id,我将it_id初始化为1000,并为为此id打印的每一行增加此值。
然后我遍历这个列表。根据我们是使用一对还是单行,我要么计算delta,要么不计算。
for item in sorted(data):
it_id = 1000
for sub in [data[item][i:i+2] for i in range(len(data[item]))]:
if len(sub) == 2:
delta = datetime.strptime(sub[1][1][:-4], '%Y-%m-%d %H:%M:%S') - datetime.strptime(sub[0][1][:-4], '%Y-%m-%d %H:%M:%S')
print '%s,%s,%d,%s,%d' % (item, ','.join(sub[0]), it_id, sub[1][2], delta.days)
it_id += 1
else:
print '%s,%s,%d,%s,%d' % (item, ','.join(sub[0]), it_id, 'noFurtherEvent', -1)
<强>输出:强>
id1,apple,1977-06-26 00:00:00.000,positive,1000,positive,1101
id1,apple,1980-07-01 00:00:00.000,positive,1001,negative,-61
id1,candy,1980-05-01 00:00:00.000,negative,1002,positive,204
id1,apple,1980-11-21 00:00:00.000,positive,1003,noFurtherEvent,-1
id2,fruit,1980-06-26 00:00:00.000,positive,1000,negative,3652
id2,cookie,1990-06-26 00:00:00.000,negative,1001,negative,384
id2,cavity,1991-07-15 00:00:00.000,negative,1002,positive,1
id2,apple,1991-07-16 00:00:00.000,positive,1003,positive,2011
id2,apple,1997-01-16 00:00:00.000,positive,1004,noFurtherEvent,-1
id3,cookie,2010-04-20 00:00:00.000,negative,1000,noFurtherEvent,-1
id4,cookie,2010-04-20 00:00:00.000,negative,1000,negative,0
id4,cookie,2010-04-20 00:00:01.000,negative,1001,noFurtherEvent,-1
正如另一篇帖子建议的那样,你的样本输出可能与增量有关。