Question

我试图通过以下两种方式来查找PDF文件中的股票行情记录：1）使用正则表达式在文档中搜索看起来像股票行情记录器的所有内容，然后2）将其与现有的已知行情记录器列表进行比较。我对＃1没问题，但这会导致一些误报。

那么，如何在现有列表中找到匹配项时才进行匹配呢？这是我的代码：


tickers = ['CA', 'V', 'MA', 'EB', 'PE', 'QCOM', 'BAC', 'A', 'AMZN']

text = 'This is sample text that mentions different companies and tickers like 
V (Visa), QCOM (Qualcomm), A (Agilent), GE (General Electric), MA (Mastercard), 
EB (Eventbrite), and PE (Parsley Energy Inc). The output should ignore values 
that do not match regex AND do not appear in the tickers list. For example, 
GXXX, ALLL, and QQWE should not match since they do not appear in the 
tickers list.'

regex = re.compile(r'\b[A-Z]{1,5}\b[.!?]?')

matches = regex.finditer(text)
for match in matches:
    print(match)

Answer 1

这里的一种方法是建立股票行情记录的正则表达式替代，然后使用re.findall查找所有匹配项：

regex = "\\b(?:" + "|".join(tickers) + ")\\b"
matches = re.findall(regex, text)
print(matches)

['V', 'QCOM', 'A', 'MA', 'EB', 'PE']

如果您想知道，这是使用的正则表达式模式：

\b(?:CA|V|MA|EB|PE|QCOM|BAC|A|AMZN)\b

也就是说，它表示要匹配您的任何一种股票代号，并在词的两边都设有字边界，以防止错误的子字符串匹配。

Answer 2

除非有令人信服的理由使用它，否则regex在这里可能会显得过高。您可以简单地执行以下操作：

tickers = ['CA', 'V', 'MA', 'EB', 'PE', 'QCOM', 'BAC', 'A', 'AMZN']

text = '''This is sample text that mentions different companies and tickers like
 V (Visa), QCOM (Qualcomm), A (Agilent), GE (General Electric), MA (Mastercard),
 EB (Eventbrite), and PE (Parsley Energy Inc). The output should ignore values
 that do not match regex AND do not appear in the tickers list. For example,
 GXXX, ALLL, and QQWE should not match since they do not appear in the tickers
 list.'''

for tic1 in tickers:
    if tic1 in text.split():
        print(tic1, ' found')

Output:
V  found
MA  found
EB  found
PE  found
QCOM  found
A  found

如果文本包含多余的换行符，则需要通过以下方式将其删除：
text = text.replace('\n', '')

如何查找正则表达式匹配项，然后与现有列表匹配

2 个答案: