Question

我试图根据正则表达式条件将数据框中的列设置为另一列的子字符串。其中一列有标题，有时还有一年，例如“ Temp（2019）”或“ Temp”。我需要从该标题中提取年份（如果有的话），然后从原始单词中删除年份。因此，我将有两列，而不是一列为“ Temp（2019）”，一列为“ Temp”，另一列为“ 2019”。如果标题没有字，请输入0。

regex = r"\(\d{4}\)$"
tempYear = df['title'].str[-5:-1]
df['year'] = np.where(re.search(regex, df['title']) != None, df['title'].str[-5:-1], "0")

现在，当我运行此命令时，出现此错误：

Exception has occurred: TypeError
expected string or bytes-like object
  File "[path]", line 63, in <module>
    df['year'] = np.where(re.search(regex, df['title']) != None, df['title'].str[-5:-1], "0")

我认为这是因为我使用的是第一个条件（如果是真实条件），因为它是一个列表（我认为），而不是单个单词。换句话说，if语句具有多种类型。我不确定如何在没有标题的情况下提取年份。

标题（如果有年份）将始终采用“ [word]（[year]）”格式，并在括号中以年份结尾。我可以轻松完成

df['year'] = df['title'].str[-5:-1]

但这会在没有一年的时候引起问题。

Answer 1

在pandas中，str提供正则表达式处理，而标准库re模块无法处理pandas系列而非numpy数组。

因此，您可以通过pandas函数轻松获得所需的内容：

df['year'] = np.where(df.title.str.contains(regex), df['title'].str[-5:-1], "0")

根据正则表达式条件在数据框中设置列

1 个答案: