我有一个由以下对象创建的熊猫DataFrame
:
df = pandas.DataFrame({"imdbPage": emptyWebPageSet,
"title": emptySetTitle,
"genre1": lst1,
"genre2": lst2,
"genre3": lst3,
"genre4": lst4,
"info":infoSet,
"Runtime(mins)":movieTime,
"releaseData":releaseDateSet,
"imdbRating":ratingSet,
"numberOfVotes":votesList,
"numberOfEpisodes":noOfEpisodesSet,
"TotalRunTime(mins)":totalRunTimeSet
})
df = pandas.get_dummies(data=df, columns=['genre1', 'genre2', 'genre3', 'genre4'])
输出中的列标题如下:
output = ["imdbPage", "title", "info", "Runtime(mins)", "releaseData", "imdbRating", "numberOfVotes",
"numberOfEpisodes", """genre1_Action", "genre1_Adventure", "genre1_Animation",
"genre1_Biography", "genre1_Comedy".... etc]
我想做的是从输出中删除所有"genre1_"
,"genre2_"
部分,但是我显然不确切知道该列的名称或有多少列,只有它们以"genre1_"
,"genre2_"
,"genre3_"
或"genre4_"
开头。
答案 0 :(得分:1)
使用str.replace:
import pandas as pd
output = ["imdbPage", "title", "info", "Runtime(mins)", "releaseData", "imdbRating", "numberOfVotes",
"numberOfEpisodes", "genre1_Action", "genre1_Adventure", "genre1_Animation", "genre1_Biography",
"genre1_Comedy"]
print(pd.Series(data=output).str.replace('^genre\d+_', ''))
输出
0 imdbPage
1 title
2 info
3 Runtime(mins)
4 releaseData
5 imdbRating
6 numberOfVotes
7 numberOfEpisodes
8 Action
9 Adventure
10 Animation
11 Biography
12 Comedy
dtype: object
答案 1 :(得分:0)
您可以尝试以下操作(参考Here):
newcols = {}
for col in df.columns:
newcol = re.match("(^genre\d{1,}_)(.*$)", col).group(2)
newcols[col] = newcol
df.rename(columns=newcols, inplace=True)
print(df)
或更简洁地说:
df.rename(columns=lambda x: re.match("(^genre\d{1,}-)(.*$)", x).group(2), inplace=True)