Python |正则表达式分裂行;不是列

时间:2018-03-17 06:31:21

标签: python regex pandas text

我有一个包含5个嵌套行的数据框(都包含以下数据)

1ItWB (NL)$327,481,7484,148$123,403,4194,1039/8/172The
ExorcistWB$232,906,145-n/a-12/26/733Get
OutUni.$176,040,6653,143$33,377,0602,7812/24/174The Blair Witch
ProjectArt.$140,539,0992,538$1,512,054277/16/995The ConjuringWB
(NL)$137,400,1413,115$41,855,3262,9037/19/136Paranormal
ActivityPar.$107,918,8102,712$77,873129/25/097Interview with the
VampireWB$105,264,6082,604$36,389,7052,60411/11/94

我想要做的是分成新的行;不是专栏。

我尝试过这样的事情:

df["Box_Office"].str.split(r'([\d][A-Z][a-z]*)', expand=True)
df["Box_Office"].str.split(r'([\d][A-Z][a-z]*)', expand=True).melt()
df["Box_Office"].str.split(r'([\d][A-Z][a-z]*)', expand=True).stack().to_frame()

正则表达式在每个新等级分裂(EG:2The,3Get,4The)。我只是希望拆分创建新行,而不是列。正则表达式需要一些工作,但我很乐意自己完成这项工作。

我可以融化数据框来创建行,但随后清理会变得非常耗时(如果没有别的办法,很乐意沿着这条路走下去。)

堆叠更接近,但它分成了单独的行(这自然与我的正则表达式有关)。这感觉最接近,但我无法找到正则表达式来捕捉这个[还]。

理想的结果如下,但我真正需要的是标题和总数

Rank      Title         Studio      Gross         Theatres       Date
1         IT            WB          $327,481,748  4,138          9/8/17
2         The Exorcist  WB          $232,906,145  NA             12/26/73

以下内容更加接近

df["Box_Office"].str.split(r'(\$[0-9,/]*)', expand=True).stack().to_frame()

enter image description here

提取或拆分是否可以跨行扩展,而不是跨列?

1 个答案:

答案 0 :(得分:0)

以下是我要做的事情:

(?P<title>[A-Z](?:(?!WB|Par|Art|Uni)[-\sA-Za-z])+)
(?P<studio>WB|Par|Art|Uni)
[^$]*
(?P<gross>\$\d+(?:,\d{3})*)
(?P<theatres>(?:\d+(?:,\d{3})*)|-n/a-)
[$,\d]*?
(?P<date>(?:1[0-2]|[1-9])/\d{1,2}/\d{2})

<小时/> Python中的内容是:

import pandas as pd, re

junk = """
1ItWB (NL)$327,481,7484,148$123,403,4194,1039/8/172The
ExorcistWB$232,906,145-n/a-12/26/733Get
OutUni.$176,040,6653,143$33,377,0602,7812/24/174The Blair Witch
ProjectArt.$140,539,0992,538$1,512,054277/16/995The ConjuringWB
(NL)$137,400,1413,115$41,855,3262,9037/19/136Paranormal
ActivityPar.$107,918,8102,712$77,873129/25/097Interview with the
VampireWB$105,264,6082,604$36,389,7052,60411/11/94"""

rx = re.compile(r'''
(?P<Title>[A-Z](?:(?!WB|Par|Art|Uni)[-\sA-Za-z])+)
(?P<Studio>WB|Par|Art|Uni)
[^$]*
(?P<Gross>\$\d+(?:,\d{3})*)
(?P<Theatres>(?:\d+(?:,\d{3})*)|-n/a-)
[$,\d]*?
(?P<Date>(?:1[0-2]|[1-9])/\d{1,2}/\d{2})''', re.VERBOSE)

def replacer(d):
    d['Title'] = d['Title'].replace('\n', ' ')
    return d

records = (replacer(m.groupdict()) for m in rx.finditer(junk))
df = pd.DataFrame(records)

# reorder the columns if necessary
df = df[['Title', 'Studio', 'Gross', 'Theatres', 'Date']]
print(df)

<小时/> 这产生了

                        Title Studio         Gross Theatres      Date
0                          It     WB  $327,481,748    4,148    9/8/17
1                The Exorcist     WB  $232,906,145    -n/a-  12/26/73
2                     Get Out    Uni  $176,040,665    3,143  12/24/17
3     The Blair Witch Project    Art  $140,539,099    2,538   7/16/99
4               The Conjuring     WB  $137,400,141    3,115   7/19/13
5         Paranormal Activity    Par  $107,918,810    2,712   9/25/09
6  Interview with the Vampire     WB  $105,264,608    2,604  11/11/94

a demo for the expression on regex101.com

<小时/> 至于你原来的问题:你可以提取列然后转置数据帧(比如转动它)。但是,您从哪里获得这些数据?从某些人那里刮掉了吗?您可能想重新考虑这一步骤!