我在pandas数据框中具有类似于以下数据:
Address1 listboro:"Manhattan" listprice:1000000 listzip:"10001"
Address2 listprice:950000 listzip:"11205" listboro:"Brooklyn"
我想创建一个新的数据框,如下所示:
Address listboro listprice listzip
Address1 Manhattan 1000000 10001
Address2 Brooklyn 950000 11205
原始数据框存在两个问题:
我想使用描述为here的startswith
方法和描述为here的extraction
方法,但是数据在不一致的列中的事实使我无法接受。
答案 0 :(得分:1)
我不知道如何简单地重建新的DataFrame而不对熊猫DataFrame的每一行中的值进行排序。方法:对numpy
中的每一行进行排序,将它们构建到新的DataFrame中,并使用Series.str.extract
提取数据字段:
# Example DataFrame
0 1 2 3
0 Address1 listboro:"Manhattan" listprice:1000000 listzip:"10001"
1 Address2 listprice:950000 listzip:"11205" listboro:"Brooklyn"
# Copy values to numpy array, sort each row, and re-build the DataFrame
a = df.values
a.sort(axis=1)
df = pd.DataFrame(a)
df
0 1 2 3
0 Address1 listboro:"Manhattan" listprice:1000000 listzip:"10001"
1 Address2 listboro:"Brooklyn" listprice:950000 listzip:"11205"
# Assign names to columns
df.columns = ['Address', 'listboro', 'listprice', 'listzip']
# Extract data fields
df['listboro'] = df['listboro'].str.extract('\"(.*)\"')
df['listprice'] = df['listprice'].str.extract('\:(.*)').astype(int)
# Do not convert extracted ZIP codes from str to int, because
# some ZIP codes start with 0
df['listzip'] = df['listzip'].str.extract('\"(.*)\"')
df
Address listboro listprice listzip
0 Address1 Manhattan listprice:1000000 10001
1 Address2 Brooklyn listprice:950000 11205