Question

我有一个从SQL Server导出的大CSV文件（1.1G），我想在python中进行预处理，但是这样做有一些问题。原始CSV文件的日期时间值看起来像这样的00:07.5, 00:08.3, 00:48.7，因此我必须通过格式化整个列使其为1/12/2015 12:00:07 am, 1/12/2015 12:00:08 am, 1/12/2015 12:00:49 am格式来手动将它们手动转换为excel中的d/m/yy h:mm:ss。但是我注意到文件大小已从36.6 MB缩小到1.1G。我收到了excel通知：Possible Data Lost: Some features might be lost if you save this notebook in the comma-delimited(.csv) format. To preserve these features, save it in an Excel file format.

我尝试将文件保存为csv和xlsx格式，但是当我在python中读取文件并检查其shape时，都导致数据框中的行丢失：（26137666，4）（原始csv文件）与（1048575，4）（xlsx文件）。

我原始的excel文件日期时间数据看起来类似于this。

我的问题是：

如何防止数据丢失？
是否可以在python中转换datetime列格式？

Answer 1

从Excel，而不是python。

我发现该主题正在研究中。有一种方法可以将所有这些数据复制到Excel数据表。（以前我有一个5,000万行CSV文件存在此问题）如果有任何格式，则可以包含其他代码。试试这个。

Sub ReadCSVFiles()

Dim i, j As Double
Dim UserFileName As String
Dim strTextLine As String
Dim iFile As Integer: iFile = FreeFile

UserFileName = Application.GetOpenFilename
Open UserFileName For Input As #iFile
i = 1
j = 1
Check = False

Do Until EOF(1)
    Line Input #1, strTextLine
    If i >= 1048576 Then
        i = 1
        j = j + 1
    Else
        Sheets(1).Cells(i, j) = strTextLine
        i = i + 1
    End If
Loop
Close #iFile
End Sub

稍后，您会将所有数据存储在一个非常重的文件中，因此只需将其拆分即可。

格式化日期时间后，从SQL Server导入的CSV文件丢失了数据

1 个答案: