Question

我有一个用户喜欢的电影的数据框。在“收藏的电影”栏中，是电影和制片公司的名称。我想将制片人公司分隔到新的“制片人公司”列中，但每一行在电影名称和制片人公司名称之间都有不同的分隔符。看下面的例子：

Small Sample of the DataFrame

有人知道我可以使用的任何库或任何示例吗？我已经尝试使用 pandas.Series.str.extract 和 pandas.Series.str.split ，但是它们运行不佳。

Answer 1

将array([[1,2,3, 0.01564089, 0.01274327, 0.39282509, 0.25177788], [1,2,3, 0.08531619, 0.04668083, 0.91260452, 0.63481191], [1,2,3, 0.34607795, 0.87053449, 0.27467456, 0.02215169], [3,4,5, 0.01564089, 0.01274327, 0.39282509, 0.25177788], [3,4,5, 0.08531619, 0.04668083, 0.91260452, 0.63481191], [3,4,5, 0.34607795, 0.87053449, 0.27467456, 0.02215169]] )与正则表达式配合使用会很好

定界符由-，|左右两侧都留有空间
实际上会生成三个新列，第二个是被删除的定界符
重新加入原始数据框，并从str.extract重命名第一列和第三列
无需保留原始列，但出于示例目的

str.extract

输出

data = '''User~Favorite Movie 
Allan Michel~The Dark Knight, Harry Potter and the Sorcerer's Stone, Joker | Warner Bros 
Peter Smith~Spider-Man 2 by Columbia Pictures 
George Moore~Spider-Man 2, Spider-Man 3, Venom - Columbia Pictures'''
da = [[i.strip() for i in l.split("~")] for l in data.split("\n")]
df = pd.DataFrame(da[1:], columns=da[0])
df.join(df["Favorite Movie"].str.extract(r"(.*)[ ](by|-|\|)[ ](.*)").drop([1], axis=1)\
        .rename(columns={0:"Title", 2: "Studio"}))

在数据框中使用多个分隔符解析数据

1 个答案: