通过pandas重新格式化我的csv文件(将值集转换为列,将另一个集合与相应的值匹配)

时间:2016-03-28 19:45:04

标签: python numpy pandas

我有一个数据集,我想用熊猫预处理。这是一个包含两行数据集的示例:

| text   | rank | date      | provinces.0 | provinces.1      | provinces.2      | provinces.3 | provinces.4 | provinces.5   | provinces.6               | provinces.7               | provinces.8  | provinces.9          | provinces.10 | provinceFrequency.0 | provinceFrequency.1 | provinceFrequency.2 | provinceFrequency.3 | provinceFrequency.4 | provinceFrequency.5 | provinceFrequency.6 | provinceFrequency.7 | provinceFrequency.8 | provinceFrequency.9 | provinceFrequency.10 |   |
|--------|------|-----------|-------------|------------------|------------------|-------------|-------------|---------------|---------------------------|---------------------------|--------------|----------------------|--------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|----------------------|---|
| Topic1 | 0    | 9/13/2015 | Ontario     | Quebec           | British Columbia | Alberta     | Manitoba    | Saskatchewan  | Newfoundland and Labrador |                           |              |                      |              | 192                 | 378                 | 8                   | 10                  | 1                   | 1                   | 1                   |                     |                     |                     |                      |   |
| Topic2 | 1    | 9/13/2015 | Manitoba    | British Columbia | Alberta          | Ontario     | Nova Scotia | New Brunswick | Quebec                    | Newfoundland and Labrador | Saskatchewan | Prince Edward Island | Nunavut      | 7                   | 61                  | 51                  | 112                 | 7                   | 8                   | 11                  | 2                   | 2                   | 1                   | 2                    |   |
|        |      |           |             |                  |                  |             |             |               |                           |                           |              |                      |              |                     |                     |                     |                     |                     |                     |                     |                     |                     |                     |                      |   |

此数据集包含加拿大不同省份的推文中的趋势主题,以及“省”列。表示省名和省名频率0'表示该省份该趋势主题的点击次数。

我想将其转换为以下格式:

| Topic  | Date      | Ontario | Quebec | Nova Scotia | New Brunswick | Manitoba | British Columbia | Prince Edward Island | Saskatchewan | Alberta | Newfoundland and Labrador |
|--------|-----------|---------|--------|-------------|---------------|----------|------------------|----------------------|--------------|---------|---------------------------|
| Topic1 | 9/13/2015 | 192     | 378    | -           | -             | 1        | 8                | -                    | 1            | 10      | 1                         |

我可以使用原生Python代码和大量代码来完成它,在pandas中有没有简单的方法呢?

1 个答案:

答案 0 :(得分:2)

这项任务有点棘手:

import pandas as pd

df = pd.read_csv(r'D:\download\Sheet1.csv')

# `id_vars` helper list for `melt()`
id_vars = df.columns[df.columns.str.contains('provinces\.')].tolist()

# `value_vars` helper list for `melt()`
val_vars = df.columns[df.columns.str.contains('provinceFrequency\.')].tolist()

mlt = pd.melt(df, id_vars=id_vars, value_vars=val_vars)

mlt['variable'] = mlt['variable'].str.replace(r'provinceFrequency', 'provinces')

# add column with the _correct_ province
mlt['prov'] = mlt.apply(lambda row: row[row['variable']], axis=1)

new = mlt[['prov', 'value']].reset_index()

# free memory
del mlt

# set original df's index for joining in future
new['idx'] = new['index']%len(df)

# pivot (convert rows to columns)
pvt = pd.pivot_table(new, index='idx', columns='prov', values='value', aggfunc='first')

# free memory
del new

# join original `DF` with the pivoted DF `PVT` using index
rslt = df[['text','rank','date']].join(pvt)

print(rslt)

输出:

     text  rank       date  Alberta  British Columbia  Manitoba  \
0  Topic1     0  9/13/2015     10.0               8.0       1.0
1  Topic2     1  9/13/2015     51.0              61.0       7.0

   New Brunswick  Newfoundland and Labrador  Nova Scotia  Nunavut  Ontario  \
0            NaN                        1.0          NaN      NaN    192.0
1            8.0                        2.0          7.0      2.0    112.0

   Prince Edward Island  Quebec  Saskatchewan
0                   NaN   378.0           1.0
1                   1.0    11.0           2.0

一步一步:

In [263]: id_vars
Out[263]:
['provinces.0',
 'provinces.1',
 'provinces.2',
 'provinces.3',
 'provinces.4',
 'provinces.5',
 'provinces.6',
 'provinces.7',
 'provinces.8',
 'provinces.9',
 'provinces.10']

In [264]: val_vars
Out[264]:
['provinceFrequency.0',
 'provinceFrequency.1',
 'provinceFrequency.2',
 'provinceFrequency.3',
 'provinceFrequency.4',
 'provinceFrequency.5',
 'provinceFrequency.6',
 'provinceFrequency.7',
 'provinceFrequency.8',
 'provinceFrequency.9',
 'provinceFrequency.10']

In [265]: mlt
Out[265]:
   provinces.0       provinces.1       provinces.2 provinces.3  provinces.4  \
0      Ontario            Quebec  British Columbia     Alberta     Manitoba
1     Manitoba  British Columbia           Alberta     Ontario  Nova Scotia
2      Ontario            Quebec  British Columbia     Alberta     Manitoba
3     Manitoba  British Columbia           Alberta     Ontario  Nova Scotia
4      Ontario            Quebec  British Columbia     Alberta     Manitoba
5     Manitoba  British Columbia           Alberta     Ontario  Nova Scotia
6      Ontario            Quebec  British Columbia     Alberta     Manitoba
7     Manitoba  British Columbia           Alberta     Ontario  Nova Scotia
8      Ontario            Quebec  British Columbia     Alberta     Manitoba
9     Manitoba  British Columbia           Alberta     Ontario  Nova Scotia
10     Ontario            Quebec  British Columbia     Alberta     Manitoba
11    Manitoba  British Columbia           Alberta     Ontario  Nova Scotia
12     Ontario            Quebec  British Columbia     Alberta     Manitoba
13    Manitoba  British Columbia           Alberta     Ontario  Nova Scotia
14     Ontario            Quebec  British Columbia     Alberta     Manitoba
15    Manitoba  British Columbia           Alberta     Ontario  Nova Scotia
16     Ontario            Quebec  British Columbia     Alberta     Manitoba
17    Manitoba  British Columbia           Alberta     Ontario  Nova Scotia
18     Ontario            Quebec  British Columbia     Alberta     Manitoba
19    Manitoba  British Columbia           Alberta     Ontario  Nova Scotia
20     Ontario            Quebec  British Columbia     Alberta     Manitoba
21    Manitoba  British Columbia           Alberta     Ontario  Nova Scotia

      provinces.5                provinces.6                provinces.7  \
0    Saskatchewan  Newfoundland and Labrador                        NaN
1   New Brunswick                     Quebec  Newfoundland and Labrador
2    Saskatchewan  Newfoundland and Labrador                        NaN
3   New Brunswick                     Quebec  Newfoundland and Labrador
4    Saskatchewan  Newfoundland and Labrador                        NaN
5   New Brunswick                     Quebec  Newfoundland and Labrador
6    Saskatchewan  Newfoundland and Labrador                        NaN
7   New Brunswick                     Quebec  Newfoundland and Labrador
8    Saskatchewan  Newfoundland and Labrador                        NaN
9   New Brunswick                     Quebec  Newfoundland and Labrador
10   Saskatchewan  Newfoundland and Labrador                        NaN
11  New Brunswick                     Quebec  Newfoundland and Labrador
12   Saskatchewan  Newfoundland and Labrador                        NaN
13  New Brunswick                     Quebec  Newfoundland and Labrador
14   Saskatchewan  Newfoundland and Labrador                        NaN
15  New Brunswick                     Quebec  Newfoundland and Labrador
16   Saskatchewan  Newfoundland and Labrador                        NaN
17  New Brunswick                     Quebec  Newfoundland and Labrador
18   Saskatchewan  Newfoundland and Labrador                        NaN
19  New Brunswick                     Quebec  Newfoundland and Labrador
20   Saskatchewan  Newfoundland and Labrador                        NaN
21  New Brunswick                     Quebec  Newfoundland and Labrador

     provinces.8           provinces.9 provinces.10      variable  value  \
0            NaN                   NaN          NaN   provinces.0  192.0
1   Saskatchewan  Prince Edward Island      Nunavut   provinces.0    7.0
2            NaN                   NaN          NaN   provinces.1  378.0
3   Saskatchewan  Prince Edward Island      Nunavut   provinces.1   61.0
4            NaN                   NaN          NaN   provinces.2    8.0
5   Saskatchewan  Prince Edward Island      Nunavut   provinces.2   51.0
6            NaN                   NaN          NaN   provinces.3   10.0
7   Saskatchewan  Prince Edward Island      Nunavut   provinces.3  112.0
8            NaN                   NaN          NaN   provinces.4    1.0
9   Saskatchewan  Prince Edward Island      Nunavut   provinces.4    7.0
10           NaN                   NaN          NaN   provinces.5    1.0
11  Saskatchewan  Prince Edward Island      Nunavut   provinces.5    8.0
12           NaN                   NaN          NaN   provinces.6    1.0
13  Saskatchewan  Prince Edward Island      Nunavut   provinces.6   11.0
14           NaN                   NaN          NaN   provinces.7    NaN
15  Saskatchewan  Prince Edward Island      Nunavut   provinces.7    2.0
16           NaN                   NaN          NaN   provinces.8    NaN
17  Saskatchewan  Prince Edward Island      Nunavut   provinces.8    2.0
18           NaN                   NaN          NaN   provinces.9    NaN
19  Saskatchewan  Prince Edward Island      Nunavut   provinces.9    1.0
20           NaN                   NaN          NaN  provinces.10    NaN
21  Saskatchewan  Prince Edward Island      Nunavut  provinces.10    2.0

                         prov
0                     Ontario
1                    Manitoba
2                      Quebec
3            British Columbia
4            British Columbia
5                     Alberta
6                     Alberta
7                     Ontario
8                    Manitoba
9                 Nova Scotia
10               Saskatchewan
11              New Brunswick
12  Newfoundland and Labrador
13                     Quebec
14                        NaN
15  Newfoundland and Labrador
16                        NaN
17               Saskatchewan
18                        NaN
19       Prince Edward Island
20                        NaN
21                    Nunavut

In [269]: mlt[['prov', 'value']].reset_index()
Out[269]:
    index                       prov  value
0       0                    Ontario  192.0
1       1                   Manitoba    7.0
2       2                     Quebec  378.0
3       3           British Columbia   61.0
4       4           British Columbia    8.0
5       5                    Alberta   51.0
6       6                    Alberta   10.0
7       7                    Ontario  112.0
8       8                   Manitoba    1.0
9       9                Nova Scotia    7.0
10     10               Saskatchewan    1.0
11     11              New Brunswick    8.0
12     12  Newfoundland and Labrador    1.0
13     13                     Quebec   11.0
14     14                        NaN    NaN
15     15  Newfoundland and Labrador    2.0
16     16                        NaN    NaN
17     17               Saskatchewan    2.0
18     18                        NaN    NaN
19     19       Prince Edward Island    1.0
20     20                        NaN    NaN
21     21                    Nunavut    2.0

In [270]: # set original index for joining in future

In [271]: new['idx'] = new['index']%len(df)

In [272]: new
Out[272]:
    index                       prov  value  idx
0       0                    Ontario  192.0    0
1       1                   Manitoba    7.0    1
2       2                     Quebec  378.0    0
3       3           British Columbia   61.0    1
4       4           British Columbia    8.0    0
5       5                    Alberta   51.0    1
6       6                    Alberta   10.0    0
7       7                    Ontario  112.0    1
8       8                   Manitoba    1.0    0
9       9                Nova Scotia    7.0    1
10     10               Saskatchewan    1.0    0
11     11              New Brunswick    8.0    1
12     12  Newfoundland and Labrador    1.0    0
13     13                     Quebec   11.0    1
14     14                        NaN    NaN    0
15     15  Newfoundland and Labrador    2.0    1
16     16                        NaN    NaN    0
17     17               Saskatchewan    2.0    1
18     18                        NaN    NaN    0
19     19       Prince Edward Island    1.0    1
20     20                        NaN    NaN    0
21     21                    Nunavut    2.0    1

In [273]: pvt = pd.pivot_table(new, index='idx', columns='prov', values='value', aggfunc='first')

In [274]: pvt
Out[274]:
prov  Alberta  British Columbia  Manitoba  New Brunswick  \
idx
0        10.0               8.0       1.0            NaN
1        51.0              61.0       7.0            8.0

prov  Newfoundland and Labrador  Nova Scotia  Nunavut  Ontario  \
idx
0                           1.0          NaN      NaN    192.0
1                           2.0          7.0      2.0    112.0

prov  Prince Edward Island  Quebec  Saskatchewan
idx
0                      NaN   378.0           1.0
1                      1.0    11.0           2.0

In [275]: rslt = df[['text','rank','date']].join(pvt)

In [276]: rslt
Out[276]:
     text  rank       date  Alberta  British Columbia  Manitoba  \
0  Topic1     0  9/13/2015     10.0               8.0       1.0
1  Topic2     1  9/13/2015     51.0              61.0       7.0

   New Brunswick  Newfoundland and Labrador  Nova Scotia  Nunavut  Ontario  \
0            NaN                        1.0          NaN      NaN    192.0
1            8.0                        2.0          7.0      2.0    112.0

   Prince Edward Island  Quebec  Saskatchewan
0                   NaN   378.0           1.0
1                   1.0    11.0           2.0