Question

因此，我已经将pdf转换为数据框，并且几乎处于我希望格式成为格式的最后阶段。但是，我陷入了下一步。我有一栏，就像-

Column A
1234[321]
321[3]
123
456[456]

，并希望将其分为两个不同的列B和C，以便-

Column B          Column C
1234              321
321               3
123               0
456               456

如何实现？我确实尝试了类似的方式

df.Column A.str.strip(r"\[\d+\]")

，但尝试不同的变体后却无法通过。任何帮助将不胜感激，因为这是此任务的最后一部分。提前非常感谢！

Answer 1

替代方法可能是：

# Create the new two columns
df[["Column B", "Column C"]]=df["Column A"].str.split('[', expand=True)
# Get rid of the extra bracket
df["Column C"] = df["Column C"].str.replace("]", "")
# Get rid of the NaN and the useless column
df = df.fillna(0).drop("Column A", axis=1)
# Convert all columns to numeric
df = df.apply(pd.to_numeric)

Answer 2

您可以使用

import pandas as pd
df = pd.DataFrame({'Column A': ['1234[321]', '321[3]', '123', '456[456]']})
df[['Column B', 'Column C']] = df['Column A'].str.extract(r'^(\d+)(?:\[(\d+)])?$', expand=False)
# If you need to drop Column A here, use
# df[['Column B', 'Column C']] = df.pop('Column A').str.extract(r'^(\d+)(?:\[(\d+)])?$', expand=False)
df['Column C'][pd.isna(df['Column C'])] = 0
df
#    Column A Column B Column C
# 0  1234[321]     1234      321
# 1     321[3]      321        3
# 2        123      123        0
# 3   456[456]      456      456

请参见regex demo。匹配

^-字符串的开头
(\d+)-第1组：一个或多个数字
(?:\[(\d+)])?-匹配[的可选非捕获组，然后将一个或多个数字捕获到组2中，然后捕获]
$-字符串的结尾。

基于正则表达式分隔列|大熊猫

2 个答案: