带有美丽汤的正则表达式,在':'之后提取所有字母

时间:2017-02-17 02:03:32

标签: regex python-3.x beautifulsoup

我似乎无法按照我想要的方式使用正则表达式。

当我运行此代码时,我会看到下面的文字

for paragraph in soup.find_all('p'):
        print(paragraph.find_all(text =re.compile(":*\w*")))

我得到的文字是

Continuing our series of surfacing 2016 stinkers, here are the 25 Russell 2000 stocks that imploded in 2016. Further down, you'll find the 25 worst stocks excluding pharma. Ophthotech (NASDAQ:OPHT) -94% Galena Biopharma (NASDAQ:GALE) -93% Cempra (NASDAQ:CEMP) -91% Toaki Pharma (NASDAQ:TKAI) -89% Anthera Pharma (NASDAQ:ANTH) -86% Adeptus Health (NYSE:ADPT) -86% CytRx (NASDAQ:CYTR) -86% Novavax (NASDAQ:NVAX) -85%

其中只想提取股票代码,因此理想的输出是:

OPHT
GALE
CEMP
TKAI

等等。

我尝试了这些代码的变体:

for paragraph in soup.find_all('p'):
    print(paragraph.find_all(text =re.compile('(:\w+)')))
for paragraph in soup.find_all('p'):
    print(paragraph.find_all(text =re.compile("(:*\w*)")))
for paragraph in soup.find_all('p'):
    print(paragraph.find_all(text =re.compile('(:)?\w+')))

但大多数时候我的输出结果为

`['Continuing our ', 'series', " of surfacing 2016 stinkers, here are the 25 Russell 2000 stocks that imploded in 2016. Further down, you'll find the 25 worst stocks excluding pharma."]
['Ophthotech (NASDAQ:', 'OPHT', ') -94%']
['Galena Biopharma (NASDAQ:', 'GALE', ') -93%']
['Cempra (NASDAQ:', 'CEMP', ') -91%']
['Toaki Pharma (NASDAQ:', 'TKAI', ') -89%']
['Anthera Pharma (NASDAQ:', 'ANTH', ') -86%']
['Adeptus Health (NYSE:', 'ADPT', ') -86%']
['CytRx (NASDAQ:', 'CYTR', ') -86%']
['Novavax (NASDAQ:', 'NVAX', ') -85%']`

不确定我做错了什么。

谢谢。

2 个答案:

答案 0 :(得分:2)

你可以试试这个:

import re

text = """Continuing our series of surfacing 2016 stinkers, here are the 25 Russell 2000 stocks that imploded in 2016. Further down, you'll find the 25 worst stocks excluding pharma.
Ophthotech (NASDAQ:OPHT) -94%
Galena Biopharma (NASDAQ:GALE) -93%
Cempra (NASDAQ:CEMP) -91%
Toaki Pharma (NASDAQ:TKAI) -89%
Anthera Pharma (NASDAQ:ANTH) -86%
Adeptus Health (NYSE:ADPT) -86%
CytRx (NASDAQ:CYTR) -86%
Novavax (NASDAQ:NVAX) -85%"""

#Its better to compile a regex outside a loop
pattern = re.compile(r':(\w+)\)')

results = pattern.findall(text)

for items in results:
    print(items)

答案 1 :(得分:1)

这可能是一个好方向

re.search(r':(\w+)\)', paragraph.text).group(1)

尝试添加r''在模式之前