从csv将单列转换为分层列

时间:2018-05-16 19:24:08

标签: python pandas

当我导入一个csv文件时,该文件包含一个由州和城市组成的列,例如:

ALABAMA
NaN
Birmingham
Montgomery
Huntsville
NaN
CALIFORNIA
NaN
Los Angeles
San Diego
Fresno
NaN

我的问题是,如何将其转换为两个分层列,以便它看起来更像以下内容:

ALABAMA    Birmingham
           Montgomery
           Huntsville
CALIFORNIA Los Angeles
           San Diego
           Fresno

我尝试创建一个emtpy系列,并使用来自city列的每一行的值填充它,以便将该系列作为额外的列导入,但我无法让它工作。

我的代码:

i = 0
numcol = []
for STATE in city_state_df['citystate']:
    if STATE == '':        
        numcol.append(STATE_df['citystate'][i])
        i += 1
    elif STATE != '': 
        numcol.append(STATE_df['citystate'][i])
        i += 1
numcol

2 个答案:

答案 0 :(得分:1)

将数据读入pandas DataFrame

df = pd.read_csv('my_file.csv')

在这里,我假设该列名为place

使用groupby将以状态(全部大写字母)开头的所有行分组到下一个状态,为每个组(状态)选择第一个place并将其分配给新列在数据框中

df['state'] = df.groupby(df.place.str.isupper().cumsum()).place.transform('first')

然后删除placenullplace == state

的行
df[pd.notnull(df.place) & (df.place != df.state)]
outputs:
          place       state
2    Birmingham     ALABAMA
3    Montgomery     ALABAMA
4    Huntsville     ALABAMA
8   Los Angeles  CALIFORNIA
9     San Diego  CALIFORNIA
10       Fresno  CALIFORNIA

答案 1 :(得分:0)

另一种(可能更少的pythonic解决方案)可以是这样的:

city_state_df = pd.DataFrame({'citystate' :['ALABAMA', np.NaN, 'Birmingham', 'Huntsville', np.NaN,'CALIFORNIA',np.NaN, 'Los Angeles','San Diego',np.NaN]})

     citystate
0      ALABAMA
1          NaN
2   Birmingham
3   Huntsville
4          NaN
5   CALIFORNIA
6          NaN
7  Los Angeles
8    San Diego
9          NaN

复制列并删除第一列中不是大写的行和第二列中的大写行。在第一个中使用bfill并删除null的行。最后重命名列

city_state_df['city'] = city_state_df['citystate']
city_state_df = city_state_df.replace(np.nan, '', regex=True)
city_state_df['citystate'] = city_state_df['citystate'].apply(lambda x: x if x.isupper() else np.NaN).ffill()
city_state_df['city'] = city_state_df['city'].apply(lambda x: np.NaN if x.isupper() else x)
city_state_df = city_state_df.replace('', np.NaN, regex=True).dropna(subset=['city'])
city_state_df.columns = ['state', 'city']

输出:

        state         city
2     ALABAMA   Birmingham
3     ALABAMA   Huntsville
7  CALIFORNIA  Los Angeles
8  CALIFORNIA    San Diego