Question

考虑以下pandas数据帧：

In [114]:

df['movie_title'].head()


Out[114]:

0     Toy Story (1995)
1     GoldenEye (1995)
2    Four Rooms (1995)
3    Get Shorty (1995)
4       Copycat (1995)
...
Name: movie_title, dtype: object

更新我想用正则表达式提取电影的标题。所以，让我们使用以下正则表达式：\b([^\d\W]+)\b。所以我尝试了以下内容：

df_3['movie_title'] = df_3['movie_title'].str.extract('\b([^\d\W]+)\b')
df_3['movie_title']

但是，我得到以下内容：

0       NaN
1       NaN
2       NaN
3       NaN
4       NaN
5       NaN
6       NaN
7       NaN
8       NaN

如何从pandas数据框中的文本中提取特定功能？更具体地说，如何在一个全新的数据框中提取电影的标题？例如，所需的输出应为：

Out[114]:

0     Toy Story
1     GoldenEye
2    Four Rooms
3    Get Shorty
4       Copycat
...
Name: movie_title, dtype: object

Answer 1

您可以尝试str.extract和strip，但最好使用str.split，因为电影名称也可以是数字。下一个解决方案是regex和replace前导和尾随空格的括号strip内容：

#convert column to string
df['movie_title'] = df['movie_title'].astype(str)

#but it remove numbers in names of movies too
df['titles'] = df['movie_title'].str.extract('([a-zA-Z ]+)', expand=False).str.strip()
df['titles1'] = df['movie_title'].str.split('(', 1).str[0].str.strip()
df['titles2'] = df['movie_title'].str.replace(r'\([^)]*\)', '').str.strip()
print df
          movie_title      titles      titles1      titles2
0  Toy Story 2 (1995)   Toy Story  Toy Story 2  Toy Story 2
1    GoldenEye (1995)   GoldenEye    GoldenEye    GoldenEye
2   Four Rooms (1995)  Four Rooms   Four Rooms   Four Rooms
3   Get Shorty (1995)  Get Shorty   Get Shorty   Get Shorty
4      Copycat (1995)     Copycat      Copycat      Copycat

Answer 2

您应该使用下面的()分配文字组，以捕获其中的特定部分。

new_df['just_movie_titles'] = df['movie_title'].str.extract('(.+?) \(')
new_df['just_movie_titles']

<强> pandas.core.strings.StringMethods.extract

StringMethods.extract（pat，flags = 0，** kwargs）

使用传递的正则表达式
在每个字符串中查找组

Answer 3

我想提取符号“@”之后和符号“.”之前的文本。（句号）我试过了，它或多或少都有效，因为我有符号“@”，但无论如何我不想要这个符号：

df['col'].astype(str).str.extract('(@.+.+)

Answer 4

使用正则表达式查找括号之间存储的年份。我们指定了寄生体，因此我们不会与使用多年的电影冲突他们的头衔

movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False)

删除括号：

movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',expand=False)

从“标题”列中删除年份：

movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')

应用strip函数以消除可能出现的所有结尾空格字符：

movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())

如何使用正则表达式在pandas数据框中提取特定内容？

4 个答案: