Question

我使用的正则表达式是 \d+-\d+，但我不太确定如何分隔罗马数字以及如何使用它们创建新列。

我有这个数据集：

Date_Title                        Date                       Copies
05-21 I. Don Quixote              1605                       252
21-20 IV. Macbeth                 1629                       987
10-12 ML. To Kill a Mockingbird   1960                       478
12 V. Invisible Man               1897                       136

基本上，我想拆分“日期标题”，因此，当我打印一行时，我会得到：

('05-21 I', 'I', 'Don Quixote', 1605, 252)

或

('10-12 ML', 'ML', 'To Kill a Mockingbird',1960, 478)

首先是数字和罗马数字，其次是数字；只有罗马数字，第三个是名字，第四个和第五个和数据集一样。

Answer 1

你可以使用

df = pd.DataFrame({'Date_Title':['05-21 I. Don Quixote','21-20 IV. Macbeth','10-12 ML. To Kill a Mockingbird','12 V. Invisible Man'], 'Date':[1605,1629,1960,1897], 'Copies':[252,987,478,136]})
rx = r'^(\d+(?:-\d+)?\s*(M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})))\.\s*(.*)'
df[['NumRoman','Roman','Name']] = df.pop('Date_Title').str.extract(rx)
df = df[['NumRoman','Roman','Name', 'Date', 'Copies']]
>>> df
   NumRoman Roman                   Name  Date  Copies
0   05-21 I     I            Don Quixote  1605     252
1  21-20 IV    IV                Macbeth  1629     987
2  10-12 ML    ML  To Kill a Mockingbird  1960     478
3      12 V     V          Invisible Man  1897     136

参见regex demo。详情：

^ - 字符串的开始
(\d+(?:-\d+)?\s*(M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3}))) - 第 1 组（“NumRoman”）：
- \d+(?:-\d+)? - 一个或多个数字后跟一个可选的 - 序列和一个或多个数字
- \s* - 零个或多个空格
- (M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})) - 第 2 组（“罗马”）：有关说明，请参阅 How do you match only valid roman numerals with a regular expression?
\. - 一个点
\s* - 零个或多个空格
(.*) - 第 3 组（“名称”）：除换行符以外的零个或多个字符，尽可能多

注意 df.pop('Date_Title') 删除 Date_Title 列并将其生成为 extract 方法的输入。如果您需要保持原始列顺序，则需要 df = df[['NumRoman','Roman','Name', 'Date', 'Copies']]。

Answer 2

我很确定可能有更优化的解决方案，但这是解决问题的快速方法：

df['Date_Title'] = df['Date_Title'].apply(lambda x: (x.split()[0],x.split()[1],' '.join(x.split()[2:])

或者：

df['Date_Title'] = (df['Date_Title'].str.split().str[0],
                    df['Date_Title'].str.split().str[1],
                    ' '.join(df['Date_Title'].str.split().str[2:])

Answer 3

专注于字符串拆分：

string = "21-20 IV. Macbeth"
i = string.index(".")  # Finds the first point
date, roman = string[:i].split() # 21-20, IV
title = string[i+2:]  # Macbeth

Answer 4

df=df.assign(x=df['Date_Title'].str.split('\.').str[0],y=df['Date_Title'].str.extract('(\w+(?=\.))'),z=df['Date_Title'].str.split('\.').str[1:].str.join(','))

使用正则表达式拆分列

4 个答案: