我想从InstaCart加载大型.csv(3.4m行,206k用户)开源数据集https://www.instacart.com/datasets/grocery-shopping-2017
基本上,我无法将orders.csv加载到Pandas DataFrame中。我想学习将大文件加载到Pandas / Python中的最佳实践。
答案 0 :(得分:3)
最佳选择是以块的形式读取数据,而不是将整个文件加载到内存中。
幸运的是,Option Explicit
Sub AutoPivot()
Dim PvtTbl As PivotTable
Dim PvtCache As PivotCache
Dim PvtTblName As String
Dim pivotTableWs As Worksheet
PvtTblName = "pivotTableName"
' set the worksheet object where we will create the Pivot-Table
Set pivotTableWs = Sheets.Add(after:=Worksheets("Sheet1"))
' set the Pivot Cache (the Range is static)
Set PvtCache = ActiveWorkbook.PivotCaches.Create(SourceType:=xlDatabase, SourceData:="Sheet1!R1C1:R1048576C8")
' create a new Pivot Table in the new created sheet
Set PvtTbl = pivotTableWs.PivotTables.Add(PivotCache:=PvtCache, TableDestination:=pivotTableWs.Range("A1"), TableName:=PvtTblName)
' after we set the PvtTbl object, we can easily modifty all it's properties
With PvtTbl
.ColumnGrand = True
.HasAutoFormat = True
.DisplayErrorString = False
.DisplayNullString = True
.EnableDrilldown = True
.ErrorString = ""
.MergeLabels = False
.NullString = ""
.PageFieldOrder = 2
.PageFieldWrapCount = 0
.PreserveFormatting = True
.RowGrand = True
.SaveData = True
.PrintTitles = False
.RepeatItemsOnEachPrintedPage = True
.TotalsAnnotation = False
.CompactRowIndent = 1
.InGridDropZones = False
.DisplayFieldCaptions = True
.DisplayMemberPropertyTooltips = False
.DisplayContextTooltips = True
.ShowDrillIndicators = True
.PrintDrillIndicators = False
.AllowMultipleFilters = False
.SortUsingCustomLists = True
.FieldListSortAscending = False
.ShowValuesRow = False
.CalculatedMembersInFilters = False
.RowAxisLayout xlCompactRow
With .PivotCache
.RefreshOnFileOpen = False
.MissingItemsLimit = xlMissingItemsDefault
End With
.RepeatAllLabels xlRepeatLabels
With .PivotFields("field1")
.Orientation = xlRowField
.Position = 1
End With
.AddDataField .PivotFields("ticketid"), "Count of field1", xlCount
With .PivotFields("field2")
.Orientation = xlColumnField
.Position = 1
End With
End With
End Sub
方法接受read_csv
参数。
chunksize
注意:通过指定for chunk in pd.read_csv(file.csv, chunksize=somesize):
process(chunk)
到chunksize
或read_csv
,返回值将是read_table
类型的iterable
对象:
另见:
答案 1 :(得分:0)
当您拥有可能不适合内存的大型数据框时,dask非常有用。我链接到的主页面上有关于如何创建一个dask数据帧的示例,该数据帧与pandas具有相同的API但可以分发。
答案 2 :(得分:0)
根据您的机器,您可以通过在读取csv文件时指定数据类型来读取内存中的所有内容。当pandas读取csv时,使用的默认数据类型可能不是最好的。使用dtype
可以指定数据类型。它减少了读入内存的数据帧的大小。