我有推文列表。它们看起来像这样:
data = [['trading $aa $BB stock market info'],
['$aa is $116 market is doing well $cc $ABC']]
我想提取股票代码:
['$aa', '$BB']
['$aa', '$cc', '$ABC']]
我试过这个:
for i in data:
print re.findall(r'[$]\S*', str(i))
而且,输出还包含116美元:
['$aa', '$BB']
['$aa', '$116', '$cc', '$ABC']]
有什么建议吗?
答案 0 :(得分:3)
匹配美元符号,一个字母,然后匹配任何不是空格的内容:
re.findall(r'[$][A-Za-z][\S]*', str(i))
答案 1 :(得分:1)
包 reticker
通过根据其配置创建自定义正则表达式来实现此目的。它使用创建的模式从文本中提取代码。或者,返回的模式可以独立使用。
>>> import reticker
>>> extractor = reticker.TickerExtractor()
>>> type(extractor.pattern)
<class 're.Pattern'>
>>> extractor.extract("Comparing FNGU vs $WEBL vs SOXL- who wins? And what about $cldl vs $Skyu? IMHO, SOXL is king!\nBTW, will the $w+$Z pair still grow?")
['FNGU', 'WEBL', 'SOXL', 'CLDL', 'SKYU', 'W', 'Z']
>>> extractor.extract("Which of BTC-USD, $ETH-USD and $ada-usd is best?\nWhat about $Brk.a and $Brk.B? Compare futures MGC=F and SIL=F.")
['BTC-USD', 'ETH-USD', 'ADA-USD', 'BRK.A', 'BRK.B', 'MGC=F', 'SIL=F']