一段时间以来,我一直在寻找一种可靠的正则表达式来从版权声明中提取公司名称(并且对正则表达式不甚了解)。
此问题:Regex to match company names from copyright statements under several conditions
我得到了正则表达式:
HTTPOb := CoWinHttpRequest.Create;
HTTPOb.SetTimeouts(RESOLVETIMEOUT,CONNECTTIMEOUT,SENDTIMEOUT,RECEIVETIMEOUT);
URL := sServer;
HTTPOb.Open(bStrMethod, URL, false);
HTTPOb.SetRequestHeader('MAXM_HOST_OS_NAME', OSName);
HTTPOb.SetRequestHeader('MAXM_HOST_OS_VER', OSVersion);
HTTPOb.SetRequestHeader('MAXM_HOST_APP_NAME', ProductName);
HTTPOb.SetRequestHeader('MAXM_HOST_APP_VER', ProductVersion);
HTTPOb.Send(xml);
但是当我尝试更多示例时,我发现这还不够。我想对其进行更改,使其也符合以下条件,同时仍适用于所有先前的情况:
示例:
(?i)(?:©(?:\s*Copyright)?|Copyright(?:\s*©)?)\s*\d+(?:\s*-\s*\d+)?\s*(.*?(?=\W*All\s+rights\s+reserved)|[^.]*(?=\.)|.*)
示例:
602-226-2389 ©2019 Endurance International Group. Copyright 1999 — 2019 © Iflexion. All rights reserved.
示例:
ISO 9001:2008, ISO/ IEC 27001:2005 © Mobikasa 2019
示例:
© 2019 Copyright arcadia.io. 2018 © Power Tools LLC
答案 0 :(得分:2)
您可以使用
(?i)(?:©(?:\s*(?:\d{4}(?:\s*[-—–]\s*\d{4})?)?\s*Copyright)?|Copyright(?:\s*(?:\d{4}(?:\s*[-—–]\s*\d{4})?)?\s*©)?)(?:\s*\d{4}(?:\s*[-—–]\s*\d{4})?)?\s*(.*?(?=\s*[.|]|\W*All\s+rights\s+reserved)|.*\b)
请参见Express doc on res.sendFile
Python代码:
import re
s = "Copyright © 2019 Apple Inc. All rights reserved.\r\n© 2019 Quid, Inc. All Rights Reserved.\r\n© 2009 Database Designs \r\n© 2019 Rediker Software, All Rights Reserved\r\n©2019 EVOSUS, INC. ALL RIGHTS RESERVED\r\n© 2019 Walmart. All Rights Reserved.\r\n© Copyright 2003-2019 Exxon Mobil Corporation. All Rights Reserved.\r\nCopyright © 1978-2019 Berkshire Hathaway Inc.\r\n© 2019 McKesson Corporation\r\n© 2019 UnitedHealth Group. All rights reserved.\r\n© Copyright 1999 - 2019 CVS Health\r\nCopyright 2019 General Motors. All Rights Reserved.\r\n© 2019 Ford Motor Company\r\n©2019 AT&T Intellectual Property. All rights reserved.\r\n© 2019 GENERAL ELECTRIC\r\nCopyright ©2019 AmerisourceBergen Corporation. All Rights Reserved.\r\n© 2019 Verizon\r\n© 2019 Fannie Mae\r\nCopyright © 2018 Jonas Construction Software Inc. All rights reserved.\r\nAll Comments © Copyright 2017 Kroger | The Kroger Co. All Rights Reserved\r\n© 2019 Express Scripts Holding Company. All Rights Reserved. 1 Express Way, St. Louis, MO 63121\r\n© 2019 JPMorgan Chase & Co.\r\nCopyright © 1995 - 2018 Boeing. All Rights Reserved.\r\n© 2019 Bank of America Corporation. All rights reserved.\r\n© 1999 - 2019 Wells Fargo. All rights reserved. NMLSR ID 399801\r\n©2019 Cardinal Health. All rights reserved.\r\n© 2019 Quid, Inc All Rights Reserved.\r\n602-226-2389 ©2019 Endurance International Group.\r\nCopyright 1999 — 2019 © Iflexion. All rights reserved.\r\nISO 9001:2008, ISO/ IEC 27001:2005 © Mobikasa 2019\r\n© 2019 Copyright arcadia.io.\r\n2018 © Power Tools LLC\r\nCopyright 2019 ComputerEase Construction Software | 1-800-544-2530\r\n© 2019 3M. 3M Health Information Systems Privacy Policy"
rx = r'''(?xi)
(?:© # Start of a group: © symbol
(?:\s* # Start of optional group: 0+ whitespaces
(?:\d{4} # Start of optional group: 4 digits
(?:\s*[-—–]\s*\d{4})? # 0+ spaces, dashes, spaces, 4 digits
)? # End of group
\s*Copyright # Spaces and Copyright
)? # End of group
| # OR
Copyright
(?:\s* # Start of optional group: 0+ whitespaces
(?:\d{4} # Start of optional group: 4 digits
(?:\s*[-—–]\s*\d{4})? # 0+ spaces, dashes, spaces, 4 digits
)?\s*© # End of group, 0+ spaces, ©
)? # End of group
) # End of group
(?:\s*\d{4}(?:\s*[-—–]\s*\d{4})?)? # Optional group, 9999 optionally followed with dash enclosed with whitespaces and then 9999
\s* # 0+ whitespaces
( # Start of a capturing group:
.*? # any 0+ chars other than linebreak chars, as few as possible, up to...
(?=\s*[.|]| # 0+ spaces and then | or ., or
\W*All\s+rights\s+reserved) # All rights reserved with any 0+ non-word chars before it
| # or
.*\b # any 0+ chars other than linebreak chars, as many as possible
)'''
for m in re.findall(rx, s):
print(m)
请参见regex demo。输出:
Apple Inc
Quid, Inc
Database Designs
Rediker Software
EVOSUS, INC
Walmart
Exxon Mobil Corporation
Berkshire Hathaway Inc
McKesson Corporation
UnitedHealth Group
CVS Health
General Motors
Ford Motor Company
AT&T Intellectual Property
GENERAL ELECTRIC
AmerisourceBergen Corporation
Verizon
Fannie Mae
Jonas Construction Software Inc
Kroger
Express Scripts Holding Company
JPMorgan Chase & Co
Boeing
Bank of America Corporation
Wells Fargo
Cardinal Health
Quid, Inc
Endurance International Group
Iflexion
Mobikasa 2019
arcadia
Power Tools LLC
ComputerEase Construction Software
3M
答案 1 :(得分:0)
我相信this regex会为您提供所需的东西。这是解释:
(?i) # make the regex case insensitive
(?:Copyright\s*©?|©\s*(Copyright)?) # Look for Copyright and/or © to get us started
([\d\s—-]+)? # There might be some digits, spaces, and dashes, but not necessarily
(©|Copyright)?\s* # Copyright or © could be separated by dates, so look for them again
(.+?) # This is the sugar we're looking for
(?=All rights reserved|\||$) # If you find "All rights reserved" a | or end of string, stop capturing the text
答案 2 :(得分:0)
我知道它的老问题,但想发布更好的解决方案。 我训练了spacy模型,该模型对5k +版权文本样本进行了训练。 这是模型和作品Repo Link