以最小的内存占用分割大型Pandas Dataframe

时间:2016-06-26 14:53:28

标签: python pandas dataframe

我有一个大型DataFrame,我想将其拆分为测试集和用于模型构建的训练集。但是,我不想复制DataFrame,因为我达到了内存限制。

是否有类似pop的操作,但对于大段,会同时删除DataFrame的一部分并允许我将其分配给新的DataFrame?像这样:

# Assume I have initialized a DataFrame (called "all") which contains my large dataset, 
# with a boolean column called "test" which indicates whether a record should be used for
# testing.
print len(all)
# 10000000 
test = all.pop_large_segment(all[test]) # not a real command, just a place holder
print len(all)
# 8000000
print len(test)     
# 2000000

3 个答案:

答案 0 :(得分:3)

如果您有足够的空间添加一列,您可以添加一个随机值,然后您可以过滤以进行测试。在这里我使用0到1之间的制服,但如果你想要不同的比例,你可以使用任何东西。

df = pd.DataFrame({'one':[1,2,3,4,5,4,3,2,1], 'two':[6,7,8,9,10,9,8,7,6], 'three':[11,12,13,14,15,14,13,12,11]})
df['split'] = np.random.randint(0, 2, size=len(df))

当然,这需要你有空间来添加一个全新的列 - 特别是如果你的数据很长,也许你没有。

另一个选项可行,例如,如果您的数据采用csv格式并且您知道行数。使用randomint执行与上述类似的操作,但将该列表传递到Pandas skiprows的{​​{1}}参数中:

read_csv()

前面有点笨拙,尤其是列表理解中的循环,并且在内存中创建这些列表是不幸的,但它仍然应该比仅创建一半数据的整个副本更好的内存范围。

为了使内存更友好,您可以加载训练器子集,训练模型,然后用其余数据覆盖训练数据帧,然后应用模型。您将被卡住num_rows = 100000 all = range(num_rows) some = np.random.choice(all, replace=False, size=num_rows/2) some.sort() trainer_df = pd.read_csv(path, skiprows=some) rest = [i for i in all if i not in some] rest.sort() df = pd.read_csv(path, skiprows=rest) some,但您永远不必同时加载两半数据。

答案 1 :(得分:1)

我会做类似@ jeff-l的事情,即保存你的数据框。当您以csv形式阅读时,请使用app.factory("geolocationService", ['$q', '$window', '$rootScope', function ($q, $window, $rootScope) { return { currentLocation: function() { var deferred = $q.defer(); if (!$window.navigator) { $rootScope.$apply(function() { deferred.reject(new Error("Geolocation is not supported")); }); } else { $window.navigator.geolocation.getCurrentPosition(function (position) { $rootScope.$apply(function() { deferred.resolve(position); }); }, function (error) { $rootScope.$apply(function() { deferred.reject(error); }); }); } return deferred.promise; } } }]); app.service('weatherService', ['$http','geolocationService', function ($http, $scope, geolocationService) { var apiKey = '...'; return function(callback){ //Note - a better, more angularish way to do this is to return the promise //itself so you'll have more flexability in the controllers. //You also don't need callback param because angular's $http.jsonp handles //that for you geolocationService.currentLocation().then(function(location){ var url = ['https://api.forecast.io/forecast/', apiKey, '/', location.lat, ',', location.lon, '?callback=JSON_CALLBACK'].join(''); return $http.jsonp(url) .then(function(data){ callback(null,data); }) catch(callback); } }; }]); 关键字。以下脚本说明了这一点:

chunksize

答案 2 :(得分:1)

由于其他答案更侧重于文件阅读,我猜你也可以做一些事情,如果由于任何原因你的DataFrame没有从文件中读取。

也许您可以查看DataFrame.drop method的代码并对其进行修改,以便修改您的DataFrame inplace(class DF(pd.DataFrame): def drop(self, labels, axis=0, level=None, inplace=False, errors='raise'): axis = self._get_axis_number(axis) axis_name = self._get_axis_name(axis) axis, axis_ = self._get_axis(axis), axis if axis.is_unique: if level is not None: if not isinstance(axis, pd.MultiIndex): raise AssertionError('axis must be a MultiIndex') new_axis = axis.drop(labels, level=level, errors=errors) else: new_axis = axis.drop(labels, errors=errors) dropped = self.reindex(**{axis_name: new_axis}) try: dropped.axes[axis_].set_names(axis.names, inplace=True) except AttributeError: pass result = dropped else: labels = com._index_labels_to_array(labels) if level is not None: if not isinstance(axis, MultiIndex): raise AssertionError('axis must be a MultiIndex') indexer = ~axis.get_level_values(level).isin(labels) else: indexer = ~axis.isin(labels) slicer = [slice(None)] * self.ndim slicer[self._get_axis_number(axis_name)] = indexer result = self.ix[tuple(slicer)] if inplace: dropped = self.ix[labels] self._update_inplace(result) return dropped else: return result, self.ix[labels] 方法已经做过)获取返回的其他原料:

df = DF({'one':[1,2,3,4,5,4,3,2,1], 'two':[6,7,8,9,10,9,8,7,6], 'three':[11,12,13,14,15,14,13,12,11]})

dropped = df.drop(range(5), inplace=True)
# or :
# partA, partB = df.drop(range(5))

这将是这样的:

if($c_users->count()>0){
  echo '<table>
            <tr>
                <th>firstname</th>
                <th>lastname</th>
                <th></th><th>
            </tr>';
   foreach($get_users1 as $user)
   {
      echo '<tr>
             <td>' . $user['firstname'] . '</td>
             <td>' . $user['lastname'] . '</td>
             <td><a href="update_user.php?edit=' . $user['_id'] . '">Modifier</td>
             <td><a href="delete_user.php?delete='.$user['_id'].'" onclick="return confirm(\'Do you really want to delete this user ?\')">Supprimer</td>
            </tr>';
   }
        echo '</table>';
} else {
    echo 'This name is not found in database !';    
}

这个例子可能不是真正的内存效率,但也许你可以通过使用某种面向对象的解决方案找到更好的东西。