我有一个包含5个嵌套行的数据框(都包含以下数据)
1ItWB (NL)$327,481,7484,148$123,403,4194,1039/8/172The
ExorcistWB$232,906,145-n/a-12/26/733Get
OutUni.$176,040,6653,143$33,377,0602,7812/24/174The Blair Witch
ProjectArt.$140,539,0992,538$1,512,054277/16/995The ConjuringWB
(NL)$137,400,1413,115$41,855,3262,9037/19/136Paranormal
ActivityPar.$107,918,8102,712$77,873129/25/097Interview with the
VampireWB$105,264,6082,604$36,389,7052,60411/11/94
我想要做的是分成新的行;不是专栏。
我尝试过这样的事情:
df["Box_Office"].str.split(r'([\d][A-Z][a-z]*)', expand=True)
df["Box_Office"].str.split(r'([\d][A-Z][a-z]*)', expand=True).melt()
df["Box_Office"].str.split(r'([\d][A-Z][a-z]*)', expand=True).stack().to_frame()
正则表达式在每个新等级分裂(EG:2The,3Get,4The)。我只是希望拆分创建新行,而不是列。正则表达式需要一些工作,但我很乐意自己完成这项工作。
我可以融化数据框来创建行,但随后清理会变得非常耗时(如果没有别的办法,很乐意沿着这条路走下去。)
堆叠更接近,但它分成了单独的行(这自然与我的正则表达式有关)。这感觉最接近,但我无法找到正则表达式来捕捉这个[还]。
理想的结果如下,但我真正需要的是标题和总数
Rank Title Studio Gross Theatres Date
1 IT WB $327,481,748 4,138 9/8/17
2 The Exorcist WB $232,906,145 NA 12/26/73
以下内容更加接近
df["Box_Office"].str.split(r'(\$[0-9,/]*)', expand=True).stack().to_frame()
提取或拆分是否可以跨行扩展,而不是跨列?
答案 0 :(得分:0)
以下是我要做的事情:
(?P<title>[A-Z](?:(?!WB|Par|Art|Uni)[-\sA-Za-z])+)
(?P<studio>WB|Par|Art|Uni)
[^$]*
(?P<gross>\$\d+(?:,\d{3})*)
(?P<theatres>(?:\d+(?:,\d{3})*)|-n/a-)
[$,\d]*?
(?P<date>(?:1[0-2]|[1-9])/\d{1,2}/\d{2})
<小时/>
Python
中的内容是:
import pandas as pd, re
junk = """
1ItWB (NL)$327,481,7484,148$123,403,4194,1039/8/172The
ExorcistWB$232,906,145-n/a-12/26/733Get
OutUni.$176,040,6653,143$33,377,0602,7812/24/174The Blair Witch
ProjectArt.$140,539,0992,538$1,512,054277/16/995The ConjuringWB
(NL)$137,400,1413,115$41,855,3262,9037/19/136Paranormal
ActivityPar.$107,918,8102,712$77,873129/25/097Interview with the
VampireWB$105,264,6082,604$36,389,7052,60411/11/94"""
rx = re.compile(r'''
(?P<Title>[A-Z](?:(?!WB|Par|Art|Uni)[-\sA-Za-z])+)
(?P<Studio>WB|Par|Art|Uni)
[^$]*
(?P<Gross>\$\d+(?:,\d{3})*)
(?P<Theatres>(?:\d+(?:,\d{3})*)|-n/a-)
[$,\d]*?
(?P<Date>(?:1[0-2]|[1-9])/\d{1,2}/\d{2})''', re.VERBOSE)
def replacer(d):
d['Title'] = d['Title'].replace('\n', ' ')
return d
records = (replacer(m.groupdict()) for m in rx.finditer(junk))
df = pd.DataFrame(records)
# reorder the columns if necessary
df = df[['Title', 'Studio', 'Gross', 'Theatres', 'Date']]
print(df)
<小时/> 这产生了
Title Studio Gross Theatres Date
0 It WB $327,481,748 4,148 9/8/17
1 The Exorcist WB $232,906,145 -n/a- 12/26/73
2 Get Out Uni $176,040,665 3,143 12/24/17
3 The Blair Witch Project Art $140,539,099 2,538 7/16/99
4 The Conjuring WB $137,400,141 3,115 7/19/13
5 Paranormal Activity Par $107,918,810 2,712 9/25/09
6 Interview with the Vampire WB $105,264,608 2,604 11/11/94
见a demo for the expression on regex101.com。
<小时/> 至于你原来的问题:你可以提取列然后转置数据帧(比如转动它)。但是,您从哪里获得这些数据?从某些人那里刮掉了吗?您可能想重新考虑这一步骤!