需要有关如何使用以下CSV输入并使用熊猫将其转换为更新的新CSV的指南。
目标是使用重复的单列和分组列来清洁CSV,以提供动态支持。并不是每个人都去过同样数量的地方!
输入CSV
id name x y z x.1 y.1 z.1 state country state.1 country.1
0 1 a 1 2 3 0 9 9 NY USA PORTO PORTUGAL
1 1 b 4 5 6 9 9 0 NJ USA MADRID SPAIN
2 2 a 7 8 9 0 9 0 CT USA PARIS FRANCE
3 2 b 10 11 12 9 0 9 WY USA VENACE ITALY
新/更新的CSV
id name x y z visited_places
0 1 a [1,0] [2,9] [3,9] [{state: NY, country: USA}, {state: PORTO, state: PORTUGAL]
1 1 b [4,9] [5,9] [6,0] [{state: NJ, state: USA}, {state: MADRID, state: MADRID]
2 2 a [7,0] [8,9] [9,0] [{state: CT, state: USA}, {state: PARIS, state: PARIS]
3 2 b [10,9] [11,0] [12,9] [{state: WY, state: USA}, {state: VENACE, state: ITALY]
还没有太多的示例能够将多个重复的列按适当的顺序分组(将状态,国家/地区分为单个Visited_places列,合并,然后分组为一个数组,稍后我将其从JSON转换为struct)。
我尝试使用lreshape,melt和apply(lambda x:','。join(x))的组合,但是,我无法获得想要的最终结果。
# Have tried combining column based on column name, however, this won't cover state.1 country.1 state.2 country.2 and so on...
df['visited_places'] = df['state'].str.cat(df[['country']].values,sep=' ,')
# Have tried to combine using reshape/melt, however, the functions don't take paired state, country and in order like NJ, USA. Values are all kind of like jumbled.
df = pd.lreshape(df, {'visited_places':df.columns[df.columns.str.match('^state\.?\d?')].append(df.columns[df.columns.str.match('^country\.?\d?')])})
# Due to the above I haven't gotten to the part where I compress rows to only 4 rows for example, and all the visited_places are in an array as shown above in "New/Updated CSV" section.
预期
在熔化,分组和聚合列之后,例如x,x.1,x.2应该会得到int数组,以及state,country,state.1,country.1,state.2,country.2(合并列)应导致带有数组的单列。
实际
无法到达可以融合多个单列并将合并的列合并为一个列然后将它们放入数组的部分。