Jupyter笔记本电脑

Question

我已将以下CSV文件输入iPython Notebook：

public = pd.read_csv("categories.csv")
public

我还将pandas导入为pd，将numpy导入为np，将matplotlib.pyplot导入为plt。存在以下数据类型（以下是摘要 - 大约有100列）

In [36]:   public.dtypes
Out[37]:   parks          object
           playgrounds    object
           sports         object
           roading        object               
           resident       int64
           children       int64

我想将'公园'，'游乐场'，'体育'和'漫游'更改为类别（他们在其中有类似的缩放响应 - 每列都有不同类型的Likert响应（例如，一个人“非常同意” ，“同意”等，另一个有“非常重要”，“重要”等），其余部分为int64。

我能够创建一个单独的数据框 - public1 - 并使用以下代码将其中一列更改为类别类型：

public1 = {'parks': public.parks}
public1 = public1['parks'].astype('category')

但是，当我尝试使用此代码一次更改数字时，我没有成功：

public1 = {'parks': public.parks,
           'playgrounds': public.parks}
public1 = public1['parks', 'playgrounds'].astype('category')

尽管如此，我不想仅使用类别列创建单独的数据框。我希望它们在原始数据框中更改。

我尝试了很多方法来实现这一点，然后在这里尝试了代码：Pandas: change data type of columns ...

public[['parks', 'playgrounds', 'sports', 'roading']] = public[['parks', 'playgrounds', 'sports', 'roading']].astype('category')

并收到以下错误：

 NotImplementedError: > 1 ndim Categorical are not supported at this time

有没有办法改变'公园'，'游乐场'，'体育'，'咆哮'到类别（这样可以分析比喻尺度的反应），留下'常驻'和'儿童'（和94）其他列是字符串，int +浮动）请不要触动吗？或者，有更好的方法吗？如果有人有任何建议和/或反馈，我将非常感激....我正在慢慢地秃头撕开我的头发！

非常感谢提前。

编辑添加 - 我使用的是Python 2.7。

Answer 1

有时候，你只需要使用for循环：

for col in ['parks', 'playgrounds', 'sports', 'roading']:
    public[col] = public[col].astype('category')

Answer 2

您可以使用pandas.DataFrame.apply方法和lambda表达式来解决此问题。在您的示例中，您可以使用

df[['parks', 'playgrounds', 'sports']].apply(lambda x: x.astype('category'))

我不知道如何在地方执行此操作，所以通常我最终会得到这样的结果：

df[df.select_dtypes(['object']).columns] = df.select_dtypes(['object']).apply(lambda x: x.astype('category'))

如果您不想选择所有特定数据类型，显然可以用显式列名替换.select_dtypes（尽管在您的示例中，您似乎想要所有object类型）。

Answer 3

截至pandas 0.19.0，What's New描述了read_csv支持直接解析Categorical列。此答案仅适用于您从read_csv开始的情况，否则，我认为unutbu的答案仍然是最好的。 10,000条记录的示例：

import pandas as pd
import numpy as np

# Generate random data, four category-like columns, two int columns
N=10000
categories = pd.DataFrame({
            'parks' : np.random.choice(['strongly agree','agree', 'disagree'], size=N),
            'playgrounds' : np.random.choice(['strongly agree','agree', 'disagree'], size=N),
            'sports' : np.random.choice(['important', 'very important', 'not important'], size=N),
            'roading' : np.random.choice(['important', 'very important', 'not important'], size=N),
            'resident' : np.random.choice([1, 2, 3], size=N),
            'children' : np.random.choice([0, 1, 2, 3], size=N)
                       })
categories.to_csv('categories_large.csv', index=False)

＆lt; 0.19.0（或＆gt; = 19.0，未指定dtype）

pd.read_csv('categories_large.csv').dtypes # inspect default dtypes

children        int64
parks          object
playgrounds    object
resident        int64
roading        object
sports         object
dtype: object

＆GT; = 0.19.0

对于混合dtypes解析，可以通过在Categorical中传递字典dtype={'colname' : 'category', ...}来实现read_csv。

pd.read_csv('categories_large.csv', dtype={'parks': 'category',
                                           'playgrounds': 'category',
                                           'sports': 'category',
                                           'roading': 'category'}).dtypes
children          int64
parks          category
playgrounds    category
resident          int64
roading        category
sports         category
dtype: object

性能

稍微加速（本地jupyter笔记本），如发行说明中所述。

# unutbu's answer
%%timeit
public = pd.read_csv('categories_large.csv')
for col in ['parks', 'playgrounds', 'sports', 'roading']:
    public[col] = public[col].astype('category')
10 loops, best of 3: 20.1 ms per loop

# parsed during read_csv
%%timeit
category_cols = {item: 'category' for item in ['parks', 'playgrounds', 'sports', 'roading']}
public = pd.read_csv('categories_large.csv', dtype=category_cols)
100 loops, best of 3: 14.3 ms per loop

Answer 4

我发现使用for循环效果很好。

for col in ['col_variable_name_1', 'col_variable_name_2', ect..]:
    dataframe_name[col] = dataframe_name[col].astype(float)

Answer 5

Jupyter笔记本电脑

就我而言，我有一个大型Dataframe，其中包含许多对象，我想将其转换为类别。

因此，我要做的是选择对象列并填充缺少的NA，然后将其保存在原始数据框中，如下所示：

with db_session:
    poi = POI.get(id=some_id)
    coord = Cartesian.get(id=poi.coordinate_id)
    if coord is None:
        coord = Polar.get(id=poi.coordinate_id)
    <do something with poi and coord>

我希望这可能对以后的参考很有帮助

Answer 6

无需循环，Pandas现在可以直接进行操作，只需传递要转换的列的列表即可，Pandas会将它们全部转换。

          [Fact]
            public void TrackException_Success()
            {
                Exception ex=null;
                IDictionary<string, string> dict = null;
               var reader = new Mock<ITelemetryClientMock>();
                var mockTelemetryClient = new Mock<ITelemetryClientMock>();
//mocking method below
                mockTelemetryClient
                    .Setup(data => data.TrackException(It.IsAny<Exception>(), It.IsAny<IDictionary<string, string>>()));
                this._iAppTelemetry = new AppTelemetry(mockTelemetryClient.Object);
                this._iAppTelemetry.TrackException(ex,dict);
            }

cols = ['parks', 'playgrounds', 'sports', 'roading']:
public[cols] = public[cols].astype('category')

Answer 7

使事情变得容易。不适用。没有地图。没有圈。

    cols=data.select_dtypes(exclude='int').columns.to_list()
    data[cols]=data[cols].astype('category')

Answer 8

使用列表理解（避免循环），这会将所有带有 dtypes=object 的列转换为 dtypes=category。为了更通用，我已将 'df' 作为数据框。

df[[col for col in df.columns if df[col].dtypes == object]].astype('category', copy=False)

如果您出于某种原因想避免使用“copy=False”参数（因为 Python 文档告诉我们在使用它时要小心），您可以使用以下行。

df[[col for col in df.columns if df[col].dtypes == object]] = df[[col for col in df.columns if df[col].dtypes == object]].astype('category')

这是我在堆栈上的第一个答案，所以请善待。

Python Pandas - 将一些列类型更改为类别

8 个答案:

＆lt; 0.19.0（或＆gt; = 19.0，未指定dtype）

＆GT; = 0.19.0

性能

Jupyter笔记本电脑