我有一个大型DataFrame,我想将其拆分为测试集和用于模型构建的训练集。但是,我不想复制DataFrame,因为我达到了内存限制。
是否有类似pop的操作,但对于大段,会同时删除DataFrame的一部分并允许我将其分配给新的DataFrame?像这样:
# Assume I have initialized a DataFrame (called "all") which contains my large dataset,
# with a boolean column called "test" which indicates whether a record should be used for
# testing.
print len(all)
# 10000000
test = all.pop_large_segment(all[test]) # not a real command, just a place holder
print len(all)
# 8000000
print len(test)
# 2000000
答案 0 :(得分:3)
如果您有足够的空间添加一列,您可以添加一个随机值,然后您可以过滤以进行测试。在这里我使用0到1之间的制服,但如果你想要不同的比例,你可以使用任何东西。
df = pd.DataFrame({'one':[1,2,3,4,5,4,3,2,1], 'two':[6,7,8,9,10,9,8,7,6], 'three':[11,12,13,14,15,14,13,12,11]})
df['split'] = np.random.randint(0, 2, size=len(df))
当然,这需要你有空间来添加一个全新的列 - 特别是如果你的数据很长,也许你没有。
另一个选项可行,例如,如果您的数据采用csv格式并且您知道行数。使用randomint
执行与上述类似的操作,但将该列表传递到Pandas skiprows
的{{1}}参数中:
read_csv()
前面有点笨拙,尤其是列表理解中的循环,并且在内存中创建这些列表是不幸的,但它仍然应该比仅创建一半数据的整个副本更好的内存范围。
为了使内存更友好,您可以加载训练器子集,训练模型,然后用其余数据覆盖训练数据帧,然后应用模型。您将被卡住num_rows = 100000
all = range(num_rows)
some = np.random.choice(all, replace=False, size=num_rows/2)
some.sort()
trainer_df = pd.read_csv(path, skiprows=some)
rest = [i for i in all if i not in some]
rest.sort()
df = pd.read_csv(path, skiprows=rest)
和some
,但您永远不必同时加载两半数据。
答案 1 :(得分:1)
我会做类似@ jeff-l的事情,即保存你的数据框。当您以csv形式阅读时,请使用app.factory("geolocationService", ['$q', '$window', '$rootScope', function ($q, $window, $rootScope) {
return {
currentLocation: function() {
var deferred = $q.defer();
if (!$window.navigator) {
$rootScope.$apply(function() {
deferred.reject(new Error("Geolocation is not supported"));
});
} else {
$window.navigator.geolocation.getCurrentPosition(function (position) {
$rootScope.$apply(function() {
deferred.resolve(position);
});
}, function (error) {
$rootScope.$apply(function() {
deferred.reject(error);
});
});
}
return deferred.promise;
}
}
}]);
app.service('weatherService', ['$http','geolocationService', function ($http, $scope, geolocationService) {
var apiKey = '...';
return function(callback){
//Note - a better, more angularish way to do this is to return the promise
//itself so you'll have more flexability in the controllers.
//You also don't need callback param because angular's $http.jsonp handles
//that for you
geolocationService.currentLocation().then(function(location){
var url = ['https://api.forecast.io/forecast/', apiKey, '/', location.lat, ',', location.lon, '?callback=JSON_CALLBACK'].join('');
return $http.jsonp(url)
.then(function(data){ callback(null,data); })
catch(callback);
}
};
}]);
关键字。以下脚本说明了这一点:
chunksize
答案 2 :(得分:1)
由于其他答案更侧重于文件阅读,我猜你也可以做一些事情,如果由于任何原因你的DataFrame没有从文件中读取。
也许您可以查看DataFrame.drop
method的代码并对其进行修改,以便修改您的DataFrame inplace(class DF(pd.DataFrame):
def drop(self, labels, axis=0, level=None, inplace=False, errors='raise'):
axis = self._get_axis_number(axis)
axis_name = self._get_axis_name(axis)
axis, axis_ = self._get_axis(axis), axis
if axis.is_unique:
if level is not None:
if not isinstance(axis, pd.MultiIndex):
raise AssertionError('axis must be a MultiIndex')
new_axis = axis.drop(labels, level=level, errors=errors)
else:
new_axis = axis.drop(labels, errors=errors)
dropped = self.reindex(**{axis_name: new_axis})
try:
dropped.axes[axis_].set_names(axis.names, inplace=True)
except AttributeError:
pass
result = dropped
else:
labels = com._index_labels_to_array(labels)
if level is not None:
if not isinstance(axis, MultiIndex):
raise AssertionError('axis must be a MultiIndex')
indexer = ~axis.get_level_values(level).isin(labels)
else:
indexer = ~axis.isin(labels)
slicer = [slice(None)] * self.ndim
slicer[self._get_axis_number(axis_name)] = indexer
result = self.ix[tuple(slicer)]
if inplace:
dropped = self.ix[labels]
self._update_inplace(result)
return dropped
else:
return result, self.ix[labels]
方法已经做过)和获取返回的其他原料:
df = DF({'one':[1,2,3,4,5,4,3,2,1], 'two':[6,7,8,9,10,9,8,7,6], 'three':[11,12,13,14,15,14,13,12,11]})
dropped = df.drop(range(5), inplace=True)
# or :
# partA, partB = df.drop(range(5))
这将是这样的:
if($c_users->count()>0){
echo '<table>
<tr>
<th>firstname</th>
<th>lastname</th>
<th></th><th>
</tr>';
foreach($get_users1 as $user)
{
echo '<tr>
<td>' . $user['firstname'] . '</td>
<td>' . $user['lastname'] . '</td>
<td><a href="update_user.php?edit=' . $user['_id'] . '">Modifier</td>
<td><a href="delete_user.php?delete='.$user['_id'].'" onclick="return confirm(\'Do you really want to delete this user ?\')">Supprimer</td>
</tr>';
}
echo '</table>';
} else {
echo 'This name is not found in database !';
}
这个例子可能不是真正的内存效率,但也许你可以通过使用某种面向对象的解决方案找到更好的东西。