我有如下数据框,我想根据sic2
列中的值插入'字符串'。
conm sic2
115466 ALLEGION PLC 34.0
115471 AGILITY HEALTH INC 80.0
115473 NORDIC AMERICAN OFFSHORE 44.0
115474 AAD 54.0
115477 DORIAN LPG LTD 44.0
115484 NOMAD FOODS LTD 20.0
115486 ATHENE HOLDING LTD 63.0
115490 MIDATECH PHARMA PLC 28.0
115495 MOTIF BIO PLC 28.0
字符串中sic2
个数字的范围如下。
1-9 Agriculture, Forestry and Fishing
10-14 Mining
15-17 Construction
18-19 not used
20-39 Manufacturing
40-49 Transportation, Communications, Electric, Gas and Sanitary service
50-51 Wholesale Trade
52-59 Retail Trade
60-67 Finance, Insurance and Real Estate
70-89 Services
91-97 Public Administration
99-99 Nonclassifiable
0 -1 Agricultural Production-Crops
如何使pandas.DataFrame
看起来像应用整个大型数据集?
我尝试了几个条件码,但它仍然失败了。
conm sic2 industry
115466 ALLEGION PLC 34.0 Manufacturing
115471 AGILITY HEALTH INC 80.0 Services
115473 NORDIC AMERICAN OFFSHORE 44.0 Transportation, Communications, Electric, Gas and Sanitary service
115474 AAD 54.0 Retail Trade
答案 0 :(得分:2)
如果您将sics
数字转换为字典,那么根据需要查找行业非常简单:
<强>代码:强>
sic = [x.strip().split(' ', 1) for x in """
1-9 Agriculture, Forestry and Fishing
10-14 Mining
15-17 Construction
18-19 not used
20-39 Manufacturing
40-49 Transportation, Communications, ...
50-51 Wholesale Trade
52-59 Retail Trade
60-67 Finance, Insurance and Real Estate
70-89 Services
91-97 Public Administration
99-99 Nonclassifiable
""".split('\n')[1:-1]]
sic_dict = dict(sum([[(x, z) for x in
range(*[int(y) for y in v.split('-')])]
for v, z in sic], []))
测试代码:
df = pd.read_fwf(StringIO(u"""
number conm sic2
115466 ALLEGION PLC 34.0
115471 AGILITY HEALTH INC 80.0
115473 NORDIC AMERICAN OFFSHORE 44.0
115474 AAD 54.0
115477 DORIAN LPG LTD 44.0
115484 NOMAD FOODS LTD 20.0
115486 ATHENE HOLDING LTD 63.0
115490 MIDATECH PHARMA PLC 28.0
115495 MOTIF BIO PLC 28.0"""), header=1)
df['industry'] = df.sic2.apply(lambda x: sic_dict[int(x)])
print(df)
<强>结果:强>
number conm sic2 industry
0 115466 ALLEGION PLC 34.0 Manufacturing
1 115471 AGILITY HEALTH INC 80.0 Services
2 115473 NORDIC AMERICAN OFFSHORE 44.0 Transportation, Communications, ...
3 115474 AAD 54.0 Retail Trade
4 115477 DORIAN LPG LTD 44.0 Transportation, Communications, ...
5 115484 NOMAD FOODS LTD 20.0 Manufacturing
6 115486 ATHENE HOLDING LTD 63.0 Finance, Insurance and Real Estate
7 115490 MIDATECH PHARMA PLC 28.0 Manufacturing
8 115495 MOTIF BIO PLC 28.0 Manufacturing
答案 1 :(得分:0)
#Save your mapping table to a data frame
df2 = pd.DataFrame({'id_end': {0: 9, 1: 14, 2: 17, 3: 19, 4: 39, 5: 49, 6: 51, 7: 59, 8: 67, 9: 89, 10: 97, 11: 99, 12: 1},
'id_start': {0: 1, 1: 10, 2: 15, 3: 18, 4: 20, 5: 40, 6: 50, 7: 52, 8: 60, 9: 70, 10: 91, 11: 99, 12: 0},
'industry': {0: 'Agriculture, Forestry and Fishing', 1: 'Mining', 2: 'Construction', 3: 'not used', 4: 'Manufacturing',
5: 'Transportation, Communications, Electric, Gas and Sanitary service',
6: 'Wholesale Trade', 7: 'Retail Trade', 8: 'Finance, Insurance and Real Estate', 9: 'Services',
10: 'Public Administration', 11: 'Nonclassifiable', 12: 'Agricultural Production Crops'}})
df2 = df2.sort_values(by='id_end')
Out[354]:
id_end id_start industry
12 1 0 Agricultural Production Crops
0 9 1 Agriculture, Forestry and Fishing
1 14 10 Mining
2 17 15 Construction
3 19 18 not used
4 39 20 Manufacturing
5 49 40 Transportation, Communications, Electric, Gas ...
6 51 50 Wholesale Trade
7 59 52 Retail Trade
8 67 60 Finance, Insurance and Real Estate
9 89 70 Services
10 97 91 Public Administration
11 99 99 Nonclassifiable
#Map sic2 number to industry names
df['industry'] = df['sic2'].astype(np.int).apply(lambda x: df2.loc[df2.id_end>=x,'industry'].iloc[0])
Out[352]:
conm sic2 industry
115466 ALLEGION PLC 34.0 Manufacturing
115471 AGILITY HEALTH INC 80.0 Services
115473 NORDIC AMERICAN OFFSHORE 44.0 Transportation, Communications, Electric, Gas ...
115474 AAD 54.0 Retail Trade
115477 DORIAN LPG LTD 44.0 Transportation, Communications, Electric, Gas ...
115484 NOMAD FOODS LTD 20.0 Manufacturing
115486 ATHENE HOLDING LTD 63.0 Finance, Insurance and Real Estate
115490 MIDATECH PHARMA PLC 28.0 Manufacturing
115495 MOTIF BIO PLC 28.0 Manufacturing