我有一个数据集,我想用熊猫预处理。这是一个包含两行数据集的示例:
| text | rank | date | provinces.0 | provinces.1 | provinces.2 | provinces.3 | provinces.4 | provinces.5 | provinces.6 | provinces.7 | provinces.8 | provinces.9 | provinces.10 | provinceFrequency.0 | provinceFrequency.1 | provinceFrequency.2 | provinceFrequency.3 | provinceFrequency.4 | provinceFrequency.5 | provinceFrequency.6 | provinceFrequency.7 | provinceFrequency.8 | provinceFrequency.9 | provinceFrequency.10 | |
|--------|------|-----------|-------------|------------------|------------------|-------------|-------------|---------------|---------------------------|---------------------------|--------------|----------------------|--------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|----------------------|---|
| Topic1 | 0 | 9/13/2015 | Ontario | Quebec | British Columbia | Alberta | Manitoba | Saskatchewan | Newfoundland and Labrador | | | | | 192 | 378 | 8 | 10 | 1 | 1 | 1 | | | | | |
| Topic2 | 1 | 9/13/2015 | Manitoba | British Columbia | Alberta | Ontario | Nova Scotia | New Brunswick | Quebec | Newfoundland and Labrador | Saskatchewan | Prince Edward Island | Nunavut | 7 | 61 | 51 | 112 | 7 | 8 | 11 | 2 | 2 | 1 | 2 | |
| | | | | | | | | | | | | | | | | | | | | | | | | | |
此数据集包含加拿大不同省份的推文中的趋势主题,以及“省”列。表示省名和省名频率0'表示该省份该趋势主题的点击次数。
我想将其转换为以下格式:
| Topic | Date | Ontario | Quebec | Nova Scotia | New Brunswick | Manitoba | British Columbia | Prince Edward Island | Saskatchewan | Alberta | Newfoundland and Labrador |
|--------|-----------|---------|--------|-------------|---------------|----------|------------------|----------------------|--------------|---------|---------------------------|
| Topic1 | 9/13/2015 | 192 | 378 | - | - | 1 | 8 | - | 1 | 10 | 1 |
我可以使用原生Python代码和大量代码来完成它,在pandas中有没有简单的方法呢?
答案 0 :(得分:2)
这项任务有点棘手:
import pandas as pd
df = pd.read_csv(r'D:\download\Sheet1.csv')
# `id_vars` helper list for `melt()`
id_vars = df.columns[df.columns.str.contains('provinces\.')].tolist()
# `value_vars` helper list for `melt()`
val_vars = df.columns[df.columns.str.contains('provinceFrequency\.')].tolist()
mlt = pd.melt(df, id_vars=id_vars, value_vars=val_vars)
mlt['variable'] = mlt['variable'].str.replace(r'provinceFrequency', 'provinces')
# add column with the _correct_ province
mlt['prov'] = mlt.apply(lambda row: row[row['variable']], axis=1)
new = mlt[['prov', 'value']].reset_index()
# free memory
del mlt
# set original df's index for joining in future
new['idx'] = new['index']%len(df)
# pivot (convert rows to columns)
pvt = pd.pivot_table(new, index='idx', columns='prov', values='value', aggfunc='first')
# free memory
del new
# join original `DF` with the pivoted DF `PVT` using index
rslt = df[['text','rank','date']].join(pvt)
print(rslt)
输出:
text rank date Alberta British Columbia Manitoba \
0 Topic1 0 9/13/2015 10.0 8.0 1.0
1 Topic2 1 9/13/2015 51.0 61.0 7.0
New Brunswick Newfoundland and Labrador Nova Scotia Nunavut Ontario \
0 NaN 1.0 NaN NaN 192.0
1 8.0 2.0 7.0 2.0 112.0
Prince Edward Island Quebec Saskatchewan
0 NaN 378.0 1.0
1 1.0 11.0 2.0
一步一步:
In [263]: id_vars
Out[263]:
['provinces.0',
'provinces.1',
'provinces.2',
'provinces.3',
'provinces.4',
'provinces.5',
'provinces.6',
'provinces.7',
'provinces.8',
'provinces.9',
'provinces.10']
In [264]: val_vars
Out[264]:
['provinceFrequency.0',
'provinceFrequency.1',
'provinceFrequency.2',
'provinceFrequency.3',
'provinceFrequency.4',
'provinceFrequency.5',
'provinceFrequency.6',
'provinceFrequency.7',
'provinceFrequency.8',
'provinceFrequency.9',
'provinceFrequency.10']
In [265]: mlt
Out[265]:
provinces.0 provinces.1 provinces.2 provinces.3 provinces.4 \
0 Ontario Quebec British Columbia Alberta Manitoba
1 Manitoba British Columbia Alberta Ontario Nova Scotia
2 Ontario Quebec British Columbia Alberta Manitoba
3 Manitoba British Columbia Alberta Ontario Nova Scotia
4 Ontario Quebec British Columbia Alberta Manitoba
5 Manitoba British Columbia Alberta Ontario Nova Scotia
6 Ontario Quebec British Columbia Alberta Manitoba
7 Manitoba British Columbia Alberta Ontario Nova Scotia
8 Ontario Quebec British Columbia Alberta Manitoba
9 Manitoba British Columbia Alberta Ontario Nova Scotia
10 Ontario Quebec British Columbia Alberta Manitoba
11 Manitoba British Columbia Alberta Ontario Nova Scotia
12 Ontario Quebec British Columbia Alberta Manitoba
13 Manitoba British Columbia Alberta Ontario Nova Scotia
14 Ontario Quebec British Columbia Alberta Manitoba
15 Manitoba British Columbia Alberta Ontario Nova Scotia
16 Ontario Quebec British Columbia Alberta Manitoba
17 Manitoba British Columbia Alberta Ontario Nova Scotia
18 Ontario Quebec British Columbia Alberta Manitoba
19 Manitoba British Columbia Alberta Ontario Nova Scotia
20 Ontario Quebec British Columbia Alberta Manitoba
21 Manitoba British Columbia Alberta Ontario Nova Scotia
provinces.5 provinces.6 provinces.7 \
0 Saskatchewan Newfoundland and Labrador NaN
1 New Brunswick Quebec Newfoundland and Labrador
2 Saskatchewan Newfoundland and Labrador NaN
3 New Brunswick Quebec Newfoundland and Labrador
4 Saskatchewan Newfoundland and Labrador NaN
5 New Brunswick Quebec Newfoundland and Labrador
6 Saskatchewan Newfoundland and Labrador NaN
7 New Brunswick Quebec Newfoundland and Labrador
8 Saskatchewan Newfoundland and Labrador NaN
9 New Brunswick Quebec Newfoundland and Labrador
10 Saskatchewan Newfoundland and Labrador NaN
11 New Brunswick Quebec Newfoundland and Labrador
12 Saskatchewan Newfoundland and Labrador NaN
13 New Brunswick Quebec Newfoundland and Labrador
14 Saskatchewan Newfoundland and Labrador NaN
15 New Brunswick Quebec Newfoundland and Labrador
16 Saskatchewan Newfoundland and Labrador NaN
17 New Brunswick Quebec Newfoundland and Labrador
18 Saskatchewan Newfoundland and Labrador NaN
19 New Brunswick Quebec Newfoundland and Labrador
20 Saskatchewan Newfoundland and Labrador NaN
21 New Brunswick Quebec Newfoundland and Labrador
provinces.8 provinces.9 provinces.10 variable value \
0 NaN NaN NaN provinces.0 192.0
1 Saskatchewan Prince Edward Island Nunavut provinces.0 7.0
2 NaN NaN NaN provinces.1 378.0
3 Saskatchewan Prince Edward Island Nunavut provinces.1 61.0
4 NaN NaN NaN provinces.2 8.0
5 Saskatchewan Prince Edward Island Nunavut provinces.2 51.0
6 NaN NaN NaN provinces.3 10.0
7 Saskatchewan Prince Edward Island Nunavut provinces.3 112.0
8 NaN NaN NaN provinces.4 1.0
9 Saskatchewan Prince Edward Island Nunavut provinces.4 7.0
10 NaN NaN NaN provinces.5 1.0
11 Saskatchewan Prince Edward Island Nunavut provinces.5 8.0
12 NaN NaN NaN provinces.6 1.0
13 Saskatchewan Prince Edward Island Nunavut provinces.6 11.0
14 NaN NaN NaN provinces.7 NaN
15 Saskatchewan Prince Edward Island Nunavut provinces.7 2.0
16 NaN NaN NaN provinces.8 NaN
17 Saskatchewan Prince Edward Island Nunavut provinces.8 2.0
18 NaN NaN NaN provinces.9 NaN
19 Saskatchewan Prince Edward Island Nunavut provinces.9 1.0
20 NaN NaN NaN provinces.10 NaN
21 Saskatchewan Prince Edward Island Nunavut provinces.10 2.0
prov
0 Ontario
1 Manitoba
2 Quebec
3 British Columbia
4 British Columbia
5 Alberta
6 Alberta
7 Ontario
8 Manitoba
9 Nova Scotia
10 Saskatchewan
11 New Brunswick
12 Newfoundland and Labrador
13 Quebec
14 NaN
15 Newfoundland and Labrador
16 NaN
17 Saskatchewan
18 NaN
19 Prince Edward Island
20 NaN
21 Nunavut
In [269]: mlt[['prov', 'value']].reset_index()
Out[269]:
index prov value
0 0 Ontario 192.0
1 1 Manitoba 7.0
2 2 Quebec 378.0
3 3 British Columbia 61.0
4 4 British Columbia 8.0
5 5 Alberta 51.0
6 6 Alberta 10.0
7 7 Ontario 112.0
8 8 Manitoba 1.0
9 9 Nova Scotia 7.0
10 10 Saskatchewan 1.0
11 11 New Brunswick 8.0
12 12 Newfoundland and Labrador 1.0
13 13 Quebec 11.0
14 14 NaN NaN
15 15 Newfoundland and Labrador 2.0
16 16 NaN NaN
17 17 Saskatchewan 2.0
18 18 NaN NaN
19 19 Prince Edward Island 1.0
20 20 NaN NaN
21 21 Nunavut 2.0
In [270]: # set original index for joining in future
In [271]: new['idx'] = new['index']%len(df)
In [272]: new
Out[272]:
index prov value idx
0 0 Ontario 192.0 0
1 1 Manitoba 7.0 1
2 2 Quebec 378.0 0
3 3 British Columbia 61.0 1
4 4 British Columbia 8.0 0
5 5 Alberta 51.0 1
6 6 Alberta 10.0 0
7 7 Ontario 112.0 1
8 8 Manitoba 1.0 0
9 9 Nova Scotia 7.0 1
10 10 Saskatchewan 1.0 0
11 11 New Brunswick 8.0 1
12 12 Newfoundland and Labrador 1.0 0
13 13 Quebec 11.0 1
14 14 NaN NaN 0
15 15 Newfoundland and Labrador 2.0 1
16 16 NaN NaN 0
17 17 Saskatchewan 2.0 1
18 18 NaN NaN 0
19 19 Prince Edward Island 1.0 1
20 20 NaN NaN 0
21 21 Nunavut 2.0 1
In [273]: pvt = pd.pivot_table(new, index='idx', columns='prov', values='value', aggfunc='first')
In [274]: pvt
Out[274]:
prov Alberta British Columbia Manitoba New Brunswick \
idx
0 10.0 8.0 1.0 NaN
1 51.0 61.0 7.0 8.0
prov Newfoundland and Labrador Nova Scotia Nunavut Ontario \
idx
0 1.0 NaN NaN 192.0
1 2.0 7.0 2.0 112.0
prov Prince Edward Island Quebec Saskatchewan
idx
0 NaN 378.0 1.0
1 1.0 11.0 2.0
In [275]: rslt = df[['text','rank','date']].join(pvt)
In [276]: rslt
Out[276]:
text rank date Alberta British Columbia Manitoba \
0 Topic1 0 9/13/2015 10.0 8.0 1.0
1 Topic2 1 9/13/2015 51.0 61.0 7.0
New Brunswick Newfoundland and Labrador Nova Scotia Nunavut Ontario \
0 NaN 1.0 NaN NaN 192.0
1 8.0 2.0 7.0 2.0 112.0
Prince Edward Island Quebec Saskatchewan
0 NaN 378.0 1.0
1 1.0 11.0 2.0