Question

我想从InstaCart加载大型.csv（3.4m行，206k用户）开源数据集https://www.instacart.com/datasets/grocery-shopping-2017

基本上，我无法将orders.csv加载到Pandas DataFrame中。我想学习将大文件加载到Pandas / Python中的最佳实践。

Answer 1

最佳选择是以块的形式读取数据，而不是将整个文件加载到内存中。

幸运的是，Option Explicit Sub AutoPivot() Dim PvtTbl As PivotTable Dim PvtCache As PivotCache Dim PvtTblName As String Dim pivotTableWs As Worksheet PvtTblName = "pivotTableName" ' set the worksheet object where we will create the Pivot-Table Set pivotTableWs = Sheets.Add(after:=Worksheets("Sheet1")) ' set the Pivot Cache (the Range is static) Set PvtCache = ActiveWorkbook.PivotCaches.Create(SourceType:=xlDatabase, SourceData:="Sheet1!R1C1:R1048576C8") ' create a new Pivot Table in the new created sheet Set PvtTbl = pivotTableWs.PivotTables.Add(PivotCache:=PvtCache, TableDestination:=pivotTableWs.Range("A1"), TableName:=PvtTblName) ' after we set the PvtTbl object, we can easily modifty all it's properties With PvtTbl .ColumnGrand = True .HasAutoFormat = True .DisplayErrorString = False .DisplayNullString = True .EnableDrilldown = True .ErrorString = "" .MergeLabels = False .NullString = "" .PageFieldOrder = 2 .PageFieldWrapCount = 0 .PreserveFormatting = True .RowGrand = True .SaveData = True .PrintTitles = False .RepeatItemsOnEachPrintedPage = True .TotalsAnnotation = False .CompactRowIndent = 1 .InGridDropZones = False .DisplayFieldCaptions = True .DisplayMemberPropertyTooltips = False .DisplayContextTooltips = True .ShowDrillIndicators = True .PrintDrillIndicators = False .AllowMultipleFilters = False .SortUsingCustomLists = True .FieldListSortAscending = False .ShowValuesRow = False .CalculatedMembersInFilters = False .RowAxisLayout xlCompactRow With .PivotCache .RefreshOnFileOpen = False .MissingItemsLimit = xlMissingItemsDefault End With .RepeatAllLabels xlRepeatLabels With .PivotFields("field1") .Orientation = xlRowField .Position = 1 End With .AddDataField .PivotFields("ticketid"), "Count of field1", xlCount With .PivotFields("field2") .Orientation = xlColumnField .Position = 1 End With End With End Sub方法接受read_csv参数。

chunksize

注意：通过指定for chunk in pd.read_csv(file.csv, chunksize=somesize): process(chunk)到chunksize或read_csv，返回值将是read_table类型的iterable对象：

另见：

Answer 2

当您拥有可能不适合内存的大型数据框时，dask非常有用。我链接到的主页面上有关于如何创建一个dask数据帧的示例，该数据帧与pandas具有相同的API但可以分发。

Answer 3

根据您的机器，您可以通过在读取csv文件时指定数据类型来读取内存中的所有内容。当pandas读取csv时，使用的默认数据类型可能不是最好的。使用dtype可以指定数据类型。它减少了读入内存的数据帧的大小。

将大数据集加载到Pandas Python中

3 个答案: