当我导入一个csv文件时,该文件包含一个由州和城市组成的列,例如:
ALABAMA
NaN
Birmingham
Montgomery
Huntsville
NaN
CALIFORNIA
NaN
Los Angeles
San Diego
Fresno
NaN
我的问题是,如何将其转换为两个分层列,以便它看起来更像以下内容:
ALABAMA Birmingham
Montgomery
Huntsville
CALIFORNIA Los Angeles
San Diego
Fresno
我尝试创建一个emtpy系列,并使用来自city列的每一行的值填充它,以便将该系列作为额外的列导入,但我无法让它工作。
我的代码:
i = 0
numcol = []
for STATE in city_state_df['citystate']:
if STATE == '':
numcol.append(STATE_df['citystate'][i])
i += 1
elif STATE != '':
numcol.append(STATE_df['citystate'][i])
i += 1
numcol
答案 0 :(得分:1)
将数据读入pandas DataFrame
df = pd.read_csv('my_file.csv')
在这里,我假设该列名为place
。
使用groupby
将以状态(全部大写字母)开头的所有行分组到下一个状态,为每个组(状态)选择第一个place
并将其分配给新列在数据框中
df['state'] = df.groupby(df.place.str.isupper().cumsum()).place.transform('first')
然后删除place
为null
或place == state
df[pd.notnull(df.place) & (df.place != df.state)]
outputs:
place state
2 Birmingham ALABAMA
3 Montgomery ALABAMA
4 Huntsville ALABAMA
8 Los Angeles CALIFORNIA
9 San Diego CALIFORNIA
10 Fresno CALIFORNIA
答案 1 :(得分:0)
另一种(可能更少的pythonic解决方案)可以是这样的:
city_state_df = pd.DataFrame({'citystate' :['ALABAMA', np.NaN, 'Birmingham', 'Huntsville', np.NaN,'CALIFORNIA',np.NaN, 'Los Angeles','San Diego',np.NaN]})
citystate
0 ALABAMA
1 NaN
2 Birmingham
3 Huntsville
4 NaN
5 CALIFORNIA
6 NaN
7 Los Angeles
8 San Diego
9 NaN
复制列并删除第一列中不是大写的行和第二列中的大写行。在第一个中使用bfill
并删除null
的行。最后重命名列
city_state_df['city'] = city_state_df['citystate']
city_state_df = city_state_df.replace(np.nan, '', regex=True)
city_state_df['citystate'] = city_state_df['citystate'].apply(lambda x: x if x.isupper() else np.NaN).ffill()
city_state_df['city'] = city_state_df['city'].apply(lambda x: np.NaN if x.isupper() else x)
city_state_df = city_state_df.replace('', np.NaN, regex=True).dropna(subset=['city'])
city_state_df.columns = ['state', 'city']
输出:
state city
2 ALABAMA Birmingham
3 ALABAMA Huntsville
7 CALIFORNIA Los Angeles
8 CALIFORNIA San Diego