我正在一个项目中,我们希望从段落的一段文本中提取公司名称,城市,州和美元金额。通常,此信息将在本段的开头,并且我一直在使用正则表达式来查找第一个美元符号(这是我们提取的金额),并在每个逗号之间查找文本,因为我们知道该顺序文本进入。例如:
company name, city, state, amount $123,456,653
我们遇到过这样的情况:可能会有Xnumer公司,其次是美元金额之前的城市和州。
Example: company name 1, city, state, company name 2, city, state, amount $123,456,653
在某些情况下,可能会给出公司名称,但下一条信息可能不是城市,而是公司名称为xxx。
Example: company name 1, company name 1 longer, city, state, amount $123,456,653
最后,我们看到了一些情况,其中可能有一个声明说要给多少公司一个美元的金额,然后是所有公司名称。
示例(摘要):Twenty-five companies have been awarded a firm-fixed-price contract under the following Global Heavyweight Service, indefinite-delivery/indefinite-quantity, fixed-price contracts with an estimated value of $284,932,621: ABX Air Inc., Wilmington, Ohio (HTC71119DC002); Air Transport International Inc., Wilmington, Ohio (HTC71119DC003); Alaska Airlines Inc., Seattle, Washington (HTC71119DC004); Allegiant Air LLC, Las Vegas, Nevada (HTC71119DC005); American Airlines, Fort Worth, Texas (HTC71119DC006); Amerijet International Inc., Fort Lauderdale, Florida (HTC71119DC007); Atlas Air Inc., Purchase, New York (HTC71119DC008;) Delta Air Lines Inc., Atlanta, Georgia (HTC71119DC009); Federal Express Corp., Washington, District of Columbia (HTC71119DC010);xxxxxxxxxxxxxx
通常,该段将如下所示(70-80%的时间):
L-3 Chesapeake Sciences Corp., Millersville, Maryland, is being awarded a $43,094,331 fixed-price-incentive,xxxxxxxxxx
只是想知道是否有人对python库有什么建议,还是有一种提取特定文本的更好方法。我考虑过要实现某种类型的API,它将提取的值(用逗号分隔后)并通过检查它是城市还是州来运行它,然后我们可能对数据在列表中的哪个位置有一个想法是,下一步可能是什么(状态)。
这是我正在使用的当前正则表达式:r '([^$]*),.*?\$([0-9,]+)
答案 0 :(得分:0)
您可以设计一些表达式来捕获该段中的那些上市公司,例如:
(?i)([a-z0-9\s.-]*),([^\r\n,]*),\s*(Ohio|Washington|Georgia|Nevada|Florida|Texas|New York|District of Columbia)\s+\(\s*([a-z0-9]{13};?)\s*\)
,然后根据需要添加或删除边界,对于其他边界,您也将类似地添加边界。
import re
string = """
Twenty-five companies have been awarded a firm-fixed-price contract under the following Global Heavyweight Service, indefinite-delivery/indefinite-quantity, fixed-price contracts with an estimated value of $284,932,621: ABX Air Inc., Wilmington, Ohio (HTC71119DC002); Air Transport International Inc., Wilmington, Ohio (HTC71119DC003); Alaska Airlines Inc., Seattle, Washington (HTC71119DC004); Allegiant Air LLC, Las Vegas, Nevada (HTC71119DC005); American Airlines, Fort Worth, Texas (HTC71119DC006); Amerijet International Inc., Fort Lauderdale, Florida (HTC71119DC007); Atlas Air Inc., Purchase, New York (HTC71119DC008;) Delta Air Lines Inc., Atlanta, Georgia (HTC71119DC009); Federal Express Corp., Washington, District of Columbia (HTC71119DC010);
"""
expression = r'(?i)([a-z0-9\s.-]*),([^\r\n,]*),\s*(Ohio|Washington|Georgia|Nevada|Florida|Texas|New York|District of Columbia)\s+\(\s*([a-z0-9]{13};?)\s*\)'
matches = re.findall(expression, string)
print(matches)
[(' ABX Air Inc.', ' Wilmington', 'Ohio', 'HTC71119DC002'), (' Air Transport International Inc.', ' Wilmington', 'Ohio', 'HTC71119DC003'), (' Alaska Airlines Inc.', ' Seattle', 'Washington', 'HTC71119DC004'), (' Allegiant Air LLC', ' Las Vegas', 'Nevada', 'HTC71119DC005'), (' American Airlines', ' Fort Worth', 'Texas', 'HTC71119DC006'), (' Amerijet International Inc.', ' Fort Lauderdale', 'Florida', 'HTC71119DC007'), (' Atlas Air Inc.', ' Purchase', 'New York', 'HTC71119DC008;'), (' Delta Air Lines Inc.', ' Atlanta', 'Georgia', 'HTC71119DC009'), (' Federal Express Corp.', ' Washington', 'District of Columbia', 'HTC71119DC010')]
如果您想探索/简化/修改表达式,可以 在右上角的面板上进行了说明 regex101.com。如果您愿意, 也可以在this link中观看它的匹配方式 针对一些样本输入。