我尝试使用Python循环遍历数据帧列,格式如下:
Town 1, AL, USA
Town 2, AL, USA
Town 3, AK, USA
Town 4, CA, USA
Town 5, DE, USA
Town 6, MI, USA
我一直在尝试使用split()
方法使用原始数据框(包括犯罪描述和URL列)和自己的列,作为数据框和Series对象。这些对象都没有可用的方法split()。
所需的输出将是STATE缩写的另一列,所以我理解我正在尝试找到等效的df.split(', ')
并为该系列或该分组附加第二个[1]索引或数据帧。 (如果我弄错了,请纠正我)。
我将如何做到这一点?
答案 0 :(得分:7)
您可以使用vectorized string methods,例如df["col"].str.split(", ").str[1]
:
>>> df
col
0 Town 1, AL, USA
1 Town 2, AL, USA
2 Town 3, AK, USA
3 Town 4, CA, USA
4 Town 5, DE, USA
5 Town 6, MI, USA
>>> df["col"].str.split(", ")
0 [Town 1, AL, USA]
1 [Town 2, AL, USA]
2 [Town 3, AK, USA]
3 [Town 4, CA, USA]
4 [Town 5, DE, USA]
5 [Town 6, MI, USA]
Name: col, dtype: object
>>> df["col"].str.split(", ").str[1]
0 AL
1 AL
2 AK
3 CA
4 DE
5 MI
Name: col, dtype: object
答案 1 :(得分:3)
使用.apply()
对列
import pandas as pd
data=[
'Town 1, AL, USA',
'Town 2, AL, USA',
'Town 3, AK, USA',
'Town 4, CA, USA',
'Town 5, DE, USA',
'Town 6, MI, USA',
]
df = pd.DataFrame( data )
print df
df['state'] = df[0].apply(lambda x: x.split(',')[1])
print df
结果
0
0 Town 1, AL, USA
1 Town 2, AL, USA
2 Town 3, AK, USA
3 Town 4, CA, USA
4 Town 5, DE, USA
5 Town 6, MI, USA
0 state
0 Town 1, AL, USA AL
1 Town 2, AL, USA AL
2 Town 3, AK, USA AK
3 Town 4, CA, USA CA
4 Town 5, DE, USA DE
5 Town 6, MI, USA MI
修改强>
BTW:我在互联网上搜索pandas split column to new columns
,您甚至可以通过这种方式将其拆分为3个新列:
def split_more(x):
return pd.Series( x.split(',') )
df[ ['town', 'state','country'] ] = df[0].apply(split_more)
print df
结果:
0 town state country
0 Town 1, AL, USA Town 1 AL USA
1 Town 2, AL, USA Town 2 AL USA
2 Town 3, AK, USA Town 3 AK USA
3 Town 4, CA, USA Town 4 CA USA
4 Town 5, DE, USA Town 5 DE USA
5 Town 6, MI, USA Town 6 MI USA
答案 2 :(得分:2)
Series have string methods可通过str
属性访问。例如,你可以使用
df['addr'].str.extract
:
In [34]: df = pd.read_table('data', sep='-', header=None, names=['addr'])
In [35]: df
Out[35]:
addr
0 Town 1, AL, USA
1 Town 2, AL, USA
2 Town 3, AK, USA
3 Town 4, CA, USA
4 Town 5, DE, USA
5 Town 6, MI, USA
In [36]: df[['Town', 'State', 'Country']] = df['addr'].str.extract(r'([^,]+),([^,]+),([^,]+)')
In [38]: del df['addr']
产量
In [39]: df
Out[39]:
Town State Country
0 Town 1 AL USA
1 Town 2 AL USA
2 Town 3 AK USA
3 Town 4 CA USA
4 Town 5 DE USA
5 Town 6 MI USA
答案 3 :(得分:0)
在比较不同方法与%timeit
的基础上,我发现在列中使用字符串时,列表推导通常是赢家。
In [1]: %paste
import pandas as pd
data=[
'Town 1, AL, USA',
'Town 2, AL, USA',
'Town 3, AK, USA',
'Town 4, CA, USA',
'Town 5, DE, USA',
'Town 6, MI, USA',
]
df = pd.DataFrame(data)
df
## -- End pasted text --
Out[1]:
0
0 Town 1, AL, USA
1 Town 2, AL, USA
2 Town 3, AK, USA
3 Town 4, CA, USA
4 Town 5, DE, USA
5 Town 6, MI, USA
%timeit tests:
In [2]: %timeit df['state'] = [x.split(',')[1] for x in df[0]]
1000 loops, best of 3: 350 µs per loop
In [3]: %timeit df['state'] = df[0].apply(lambda x: x.split(',')[1])
1000 loops, best of 3: 671 µs per loop
In [4]: %timeit df['state'] = df[0].str.split(", ").str[1]
100 loops, best of 3: 1.1 ms per loop
答案 4 :(得分:0)
split_str_columns_df
循环以一次拆分所有字符串列。" "
或 ","
或.... 只需在上面看到的函数定义中引入它即可
new = df[col].str.split(" ", n = 1, expand = True)
或者如果您想要 ,
并分成3列(n = 2),则必须对功能进行一些调整以合并第3列
new = df[col].str.split(", ", n = 2, expand = True)
data_df.head(3)
。
Rating Score Ocupation
0 RATINGSTUFE F NaN Animator Senior
1 RATINGSTUFE B 4.0 Animator
2 NaN 7.0 Art administrator
split_str_columns_df(data_df,columns)
我要拆分的列是'Rating'
和'Ocupation'
。
columns=['Rating','Ocupation']
dff=split_str_columns_df(data_df,columns)
输出:
Score Rating_a Rating_b Ocupation_a Ocupation_b
0 NaN RATINGSTUFE F Animator Senior
1 4.0 RATINGSTUFE B Animator None
2 7.0 NaN NaN Art administrator
split_str_columns_df(data_df,columns)
我使用的函数定义是:
def split_str_columns_df(dataframe,str_columns):
''' Function that splits the str columns " " is the separation, create 2 new
columns and remove the original. If the column's name is 'Name' the 2 new columns will be 'Name_a' and 'Name_b'.'''
# new data frame with split value columns
df=dataframe
for i in range(len(str_columns)):
col=str_columns[i]
new_col1=col+'_a'
new_col2=col+'_b'
#Split
new = df[col].str.split(" ", n = 1, expand = True)
# making seperate first name column from new data frame
df[new_col1]= new[0]
# making seperate last name column from new data frame
df[new_col2]= new[1]
# Dropping old Name columns
df.drop(columns =[col], inplace = True)
return df
注意!:
拆分NaN值时,新的2列将同时获得nan(两个)(颜色Rating_a
,Rating_b
如果一行包含一个单词,则在拆分第二列时,您会得到 None
(列Ocupation_b
)
意识到原来的列Rating
和Ocupations
已删除,我们有
Rating_a
和Rating_b
。还有Ocupations_a
和Ocupations_b
。
生成示例数据:
data_df=pd.DataFrame(['RATINGSTUFE F', 'RATINGSTUFE B',np.nan, 'RATINGSTUFE L',
'RATINGSTUFE G', np.nan, 'RATINGSTUFE M', 'RATINGSTUFE L',
'RATINGSTUFE F', 'RATINGSTUFE M'], columns=['Rating'])
data_df['Score']=[np.nan,4,7,4,9,4,3,1,2,5]
data_df['Ocupation']=['Animator Senior', 'Animator', 'Art administrator', 'Animator Junior', 'Dancer', 'Colorist Junior', 'Ceramics artist', 'Chief creative officer','Colorist', 'Dancer']