字符串拆分循环通过Dataframe

时间:2014-07-20 18:06:47

标签: python string pandas split dataframe

我尝试使用Python循环遍历数据帧列,格式如下:

Town 1, AL, USA
Town 2, AL, USA
Town 3, AK, USA
Town 4, CA, USA
Town 5, DE, USA
Town 6, MI, USA

我一直在尝试使用split()方法使用原始数据框(包括犯罪描述和URL列)和自己的列,作为数据框和Series对象。这些对象都没有可用的方法split()。

所需的输出将是STATE缩写的另一列,所以我理解我正在尝试找到等效的df.split(', ')并为该系列或该分组附加第二个[1]索引或数据帧。 (如果我弄错了,请纠正我)。

我将如何做到这一点?

5 个答案:

答案 0 :(得分:7)

您可以使用vectorized string methods,例如df["col"].str.split(", ").str[1]

>>> df
               col
0  Town 1, AL, USA
1  Town 2, AL, USA
2  Town 3, AK, USA
3  Town 4, CA, USA
4  Town 5, DE, USA
5  Town 6, MI, USA
>>> df["col"].str.split(", ")
0    [Town 1, AL, USA]
1    [Town 2, AL, USA]
2    [Town 3, AK, USA]
3    [Town 4, CA, USA]
4    [Town 5, DE, USA]
5    [Town 6, MI, USA]
Name: col, dtype: object
>>> df["col"].str.split(", ").str[1]
0    AL
1    AL
2    AK
3    CA
4    DE
5    MI
Name: col, dtype: object

答案 1 :(得分:3)

使用.apply()对列

中的每个元素执行某些功能
import pandas as pd

data=[
    'Town 1, AL, USA',
    'Town 2, AL, USA',
    'Town 3, AK, USA',
    'Town 4, CA, USA',
    'Town 5, DE, USA',
    'Town 6, MI, USA',
]

df = pd.DataFrame( data )

print df

df['state'] = df[0].apply(lambda x: x.split(',')[1])

print df

结果

                 0
0  Town 1, AL, USA
1  Town 2, AL, USA
2  Town 3, AK, USA
3  Town 4, CA, USA
4  Town 5, DE, USA
5  Town 6, MI, USA

                 0 state
0  Town 1, AL, USA    AL
1  Town 2, AL, USA    AL
2  Town 3, AK, USA    AK
3  Town 4, CA, USA    CA
4  Town 5, DE, USA    DE
5  Town 6, MI, USA    MI

修改

BTW:我在互联网上搜索pandas split column to new columns,您甚至可以通过这种方式将其拆分为3个新列:

def split_more(x):
    return pd.Series( x.split(',') )

df[ ['town', 'state','country'] ] = df[0].apply(split_more)

print df

结果:

                 0    town state country
0  Town 1, AL, USA  Town 1    AL     USA
1  Town 2, AL, USA  Town 2    AL     USA
2  Town 3, AK, USA  Town 3    AK     USA
3  Town 4, CA, USA  Town 4    CA     USA
4  Town 5, DE, USA  Town 5    DE     USA
5  Town 6, MI, USA  Town 6    MI     USA

答案 2 :(得分:2)

Series have string methods可通过str属性访问。例如,你可以使用 df['addr'].str.extract

In [34]: df = pd.read_table('data', sep='-', header=None, names=['addr'])

In [35]: df
Out[35]: 
              addr
0  Town 1, AL, USA
1  Town 2, AL, USA
2  Town 3, AK, USA
3  Town 4, CA, USA
4  Town 5, DE, USA
5  Town 6, MI, USA

In [36]: df[['Town', 'State', 'Country']] = df['addr'].str.extract(r'([^,]+),([^,]+),([^,]+)')

In [38]: del df['addr']

产量

In [39]: df
Out[39]: 
     Town State Country
0  Town 1    AL     USA
1  Town 2    AL     USA
2  Town 3    AK     USA
3  Town 4    CA     USA
4  Town 5    DE     USA
5  Town 6    MI     USA

答案 3 :(得分:0)

在比较不同方法与%timeit的基础上,我发现在列中使用字符串时,列表推导通常是赢家。

In [1]: %paste 
import pandas as pd

data=[
    'Town 1, AL, USA',
    'Town 2, AL, USA',
    'Town 3, AK, USA',
    'Town 4, CA, USA',
    'Town 5, DE, USA',
    'Town 6, MI, USA',
]

df = pd.DataFrame(data)
df

## -- End pasted text --
Out[1]: 
                 0
0  Town 1, AL, USA
1  Town 2, AL, USA
2  Town 3, AK, USA
3  Town 4, CA, USA
4  Town 5, DE, USA
5  Town 6, MI, USA

%timeit tests:

In [2]: %timeit df['state'] = [x.split(',')[1] for x in df[0]]
1000 loops, best of 3: 350 µs per loop

In [3]: %timeit df['state'] = df[0].apply(lambda x: x.split(',')[1])
1000 loops, best of 3: 671 µs per loop

In [4]: %timeit df['state'] = df[0].str.split(", ").str[1]
100 loops, best of 3: 1.1 ms per loop

答案 4 :(得分:0)

函数split_str_columns_df循环以一次拆分所有字符串列。

还使用拆分生成新列,并删除旧列。

您选择拆分器: " " "," 或....

只需在上面看到的函数定义中引入它即可

new = df[col].str.split(" ", n = 1, expand = True) 

或者如果您想要 , 并分成3列(n = 2),则必须对功能进行一些调整以合并第3列

new = df[col].str.split(", ", n = 2, expand = True) 

示例数据:(整个示例数据位于本文的末尾)

data_df.head(3)

    Rating          Score    Ocupation
0   RATINGSTUFE F   NaN      Animator Senior
1   RATINGSTUFE B   4.0      Animator
2   NaN             7.0      Art administrator

调用函数: split_str_columns_df(data_df,columns)

我要拆分的列是'Rating''Ocupation'

columns=['Rating','Ocupation']
dff=split_str_columns_df(data_df,columns)

输出:

   Score     Rating_a Rating_b Ocupation_a    Ocupation_b
0    NaN  RATINGSTUFE        F    Animator         Senior
1    4.0  RATINGSTUFE        B    Animator           None
2    7.0          NaN      NaN         Art  administrator

split_str_columns_df(data_df,columns)

我使用的函数定义是:

def split_str_columns_df(dataframe,str_columns):
    ''' Function that splits the str columns " " is the separation, create 2 new 
        columns and remove the original. If the column's name is 'Name' the 2 new columns will be 'Name_a' and 'Name_b'.'''
    # new data frame with split value columns 
    df=dataframe
    for i in range(len(str_columns)):
        col=str_columns[i]
        new_col1=col+'_a'
        new_col2=col+'_b'

        #Split
        new = df[col].str.split(" ", n = 1, expand = True)   
        # making seperate first name column from new data frame 
        df[new_col1]= new[0]   
        # making seperate last name column from new data frame 
        df[new_col2]= new[1] 

        # Dropping old Name columns 
        df.drop(columns =[col], inplace = True)     
    return df

注意!

  1. 拆分NaN值时,新的2列将同时获得nan(两个)(颜色Rating_aRating_b

  2. 如果一行包含一个单词,则在拆分第二列时,您会得到 None (列Ocupation_b

    < / li>
  3. 意识到原来的列RatingOcupations已删除,我们有 Rating_aRating_b。还有Ocupations_aOcupations_b

生成示例数据:

data_df=pd.DataFrame(['RATINGSTUFE F', 'RATINGSTUFE B',np.nan, 'RATINGSTUFE L',
   'RATINGSTUFE G', np.nan, 'RATINGSTUFE M', 'RATINGSTUFE L',
   'RATINGSTUFE F', 'RATINGSTUFE M'], columns=['Rating'])

data_df['Score']=[np.nan,4,7,4,9,4,3,1,2,5]
data_df['Ocupation']=['Animator Senior', 'Animator', 'Art administrator', 'Animator Junior', 'Dancer', 'Colorist Junior', 'Ceramics artist', 'Chief creative officer','Colorist', 'Dancer']