我有一个非常大的数据集,在
中看起来像这样temp_dataframe['PPI'].str.replace('PPI/','')
现在数据集中的总行数约为5468389,这对于excel来说非常大,可以导入一列中的所有内容,因此我尝试处理部分数据。
还有其他方法吗?我可以通过它处理所有数据?
我尝试直接读取和写入文本,但每当我尝试将其作为日期读取时,由于格式的原因,它会给我 Column A
Date
2016-02-29 15:59:59.674
2016-02-29 15:59:59.695
2016-02-29 15:59:59.716
2016-02-29 15:59:59.752
2016-02-29 15:59:59.804
2016-02-29 15:59:59.869
2016-02-29 15:59:59.888
2016-02-29 15:59:59.941
2016-02-29 16:00:00.081 <-- get closest date since .081 < .941
2016-02-29 16:00:00.168
2016-02-29 16:00:00.189
2016-02-29 16:00:00.198
2016-02-29 16:00:00.247
2016-02-29 16:00:00.311
2016-02-29 16:00:00.345
2016-02-29 16:00:00.357
and for the other half an hour
2016-02-29 16:29:58.628
2016-02-29 16:29:58.639
2016-02-29 16:29:58.689
2016-02-29 16:29:58.706
2016-02-29 16:29:58.761
2016-02-29 16:29:58.865
2016-02-29 16:29:59.142
2016-02-29 16:29:59.542
2016-02-29 16:29:59.578
2016-02-29 16:30:00.171 <-- Get this date since .171 < .578
2016-02-29 16:30:00.209
2016-02-29 16:30:00.217
2016-02-29 16:30:00.245
2016-02-29 16:30:00.254
2016-02-29 16:30:00.347
2016-02-29 16:30:00.422
2016-02-29 16:30:00.457
2016-02-29 16:30:00.491
2016-02-29 16:30:00.555
2016-02-29 16:30:00.557
2016-02-29 16:30:00.645
错误。出于同样的原因,我没有使用python来解决这个问题,因为我也不熟悉python,所以我想在Excel VBA中这样做。
我也不太确定这个逻辑,所以我需要一些帮助。
Type Mismatch
此外,我不确定如何从避免计算两个日期的差异的日期获得毫秒部分
如15:59:59.674我如何才能获得Option Explicit
Sub Get_Closest_Dates()
Application.ScreenUpdating = False
Dim WI As Worksheet, WO As Worksheet
Dim i As Long, ct As Long
Dim num1 As Integer, num2 As Integer, num3 As Integer
Dim df1, df2
Set WI = Sheet1 'INPUT SHEET
Set WO = Sheet2 'OUTPUT SHEET
WI.Range("A:A").NumberFormat = "YYYY-MM-DD HH:MM:SS"
WO.Range("A:A").NumberFormat = "YYYY-MM-DD HH:MM:SS"
WI.Range("B1") = "HOUR"
WI.Range("C1") = "MINUTE"
With WI
.Range("B2").Formula = "=HOUR(A2)"
.Range("B2:B" & Rows.Count).FillDown
.Range("C2").Formula = "=MINUTE(A2)"
.Range("C2:C" & Rows.Count).FillDown
ct = WO.Range("A" & Rows.Count).End(xlUp).Row + 1
For i = 2 To 10000
num1 = .Range("C" & i).Value 'get Minutes
num2 = .Range("C" & i + 1).Value
If (num1 = 29 And num2 = 30) Then
df1 = 0.5 - TimeValue(.Range("A" & i))
df2 = TimeValue(.Range("A" & i + 1)) - 0.5
If df1 < df2 Then
WO.Range("A" & ct) = .Range("A" & i)
ct = ct + 1
Else
WO.Range("A" & ct) = .Range("A" & i + 1)
ct = ct + 1
End If
End If
If (num1 = 59 And num2 = 0) Then
df1 = 1 - TimeValue(.Range("A" & i))
df2 = TimeValue(.Range("A" & i + 1)) - 1
If df1 < df2 Then
WO.Range("A" & ct) = .Range("A" & i)
ct = ct + 1
Else
WO.Range("A" & ct) = .Range("A" & i + 1)
ct = ct + 1
End If
End If
Next i
End With
Application.ScreenUpdating = True
MsgBox "Process Completed"
End Sub
?
答案 0 :(得分:1)
似乎您的第一个问题是将数据导入Excel。了解Excel可能不是处理如此大量数据的最佳程序(诸如Access之类的DB程序可能更好),您需要在多个列或工作表之间拆分数据;或者采集数据样本。
您选择了样本,因此我会在您阅读数据时进行抽样和测试。
您还必须在处理包含毫秒的日期/时间戳时处理Excel / VBA限制。
但是为了测试数据,没有必要关注毫秒。只要您的数据按升序排列,那么具有等于或高于30分钟增量的日期/时间戳的第一行将是最早的。
下面的代码应该只读取符合该条件的大文件行。请阅读评论以获取更多信息。
将线条收集到一个集合中;然后声明,填充结果数组,并将结果写入工作表。
如果每一行包含多个字段,而不仅仅是您显示的单行,那么在编写结果时,您将声明结果数组以保存所有列,并在此时填充它。
使用Collection / Array / write到工作表序列比在处理工作表时每行编写一行要快得多。
有加速代码的方法,以及处理可能的&#34;内存不足的方法&#34;错误,但这取决于您的真实数据以及使用这个简单代码的方式。
到目前为止转换我们需要的日期/时间戳,将Excel解释为字符串,转换为&#34; real&#34;日期/时间,取决于您对后续数据的处理方式。
==========================================
Option Explicit
'Set Reference to Microsoft Scripting Runtime
Sub GetBigData()
Dim FSO As FileSystemObject
Dim TS As TextStream
Dim vFileName As Variant
Dim sLine As String
Dim dtLineTime As Date
Dim dtNextTime As Date
Dim colLines As Collection
vFileName = Application.GetOpenFilename("Text Files(*.txt), *.txt")
If vFileName = False Then Exit Sub
Set FSO = New FileSystemObject
Set TS = FSO.OpenTextFile(vFileName, ForReading, False, TristateFalse)
Set colLines = New Collection
With TS
'Assumes date/time stamps are contiguous
'skip any header lines
Do
sLine = .ReadLine
Loop Until InStr(sLine, ".") > 0
'Compute first "NextTime"
' note that it might be the first entry
' comment line 3 below if want first entry
' but would need to add logic if using other time increments
dtLineTime = CDate(Left(sLine, InStr(sLine, ".") - 1))
dtNextTime = Int(dtLineTime) + TimeSerial(Hour(dtLineTime), Int(Minute(dtLineTime) / 30) * 30, 0)
If Not (Minute(dtLineTime) = 30 Or Minute(dtLineTime) = 60) Then dtNextTime = dtNextTime + TimeSerial(0, 30, 0)
Do
'Due to IEEE rounding problems, need to test equality as a very small value
'Could use a value less than 1 second = 1/86400 or smaller
If Abs(dtLineTime - dtNextTime) < 0.00000001 Or _
dtLineTime > dtNextTime Then
colLines.Add sLine
dtNextTime = dtNextTime + TimeSerial(0, 30, 0)
End If
If Not .AtEndOfStream Then
sLine = .ReadLine
dtLineTime = CDate(Left(sLine, InStr(sLine, ".") - 1))
End If
Loop Until .AtEndOfStream
.Close
End With
'Write the collection to the worksheet
Dim V As Variant
Dim wsResults As Worksheet, rResults As Range
Dim I As Long
Set wsResults = Worksheets("sheet1")
Set rResults = wsResults.Cells(1, 1)
ReDim V(1 To colLines.Count, 1 To 1)
Set rResults = rResults.Resize(UBound(V, 1), UBound(V, 2))
For I = 1 To UBound(V, 1)
V(I, 1) = CStr(colLines(I))
Next I
With rResults
.EntireColumn.Clear
.NumberFormat = "@"
.Value = V
.EntireColumn.AutoFit
End With
End Sub
==========================================
编辑添加了时间戳转换功能。 这可以在将数据从集合对象复制到变体数组的点处实现。 EG:
V(I, 1) = ConvertTimeStamp(colLines(I))
由于收到的值是Double数据类型,您还需要在工作表上正确格式化该列,而不是将其作为Text:
.NumberFormat = "yyyy-mm-dd hh:mm:ss.000"
我们必须将值作为Double返回,因为VBA Date类型数据不支持毫秒。
==============================
Private Function ConvertTimeStamp(sTmStmp As String) As Double
Dim dtPart As Date
Dim dMS As Double 'milliseconds
Dim V As Variant
'Convert the date and time
V = Split(sTmStmp, ".")
dtPart = CDate(V(0))
dMS = V(1)
ConvertTimeStamp = dtPart + dMS / 86400 / 1000
End Function
==============================
答案 1 :(得分:0)
如果您反转排序顺序,则可以使用匹配功能查找列表中的条目索引,该条目的索引大于(仅在特定时间之后)。 类似的东西:
= MATCH(HalfHourValue,RangeContainingTimes,-1)
你必须扭转秩序;并且它为您提供索引而不是实际值。
要获得刚刚找到的条目值的毫秒数,可以使用以下内容:
= RIGHT(TEXT(INDEX(RangeContainingTimes,IxFromAbove,1), “HH:MM:ss.000”),3)