在python中将类似的字符串分组到单个组中

时间:2018-04-06 00:11:31

标签: regex python-3.x machine-learning pattern-matching cluster-analysis

我在数据框中有大约30,000个银行名称。我想将它们分组到一个基组,因为它们大多数都是相同的,只是它们位于不同的位置。但是我不知道那里有哪些银行名称。

以下是数据集的子集。根据这些数据,我可以识别出两家银行,即ROYAL BANK和BARCLAYS。所以我想得到两组。

皇家银行(数量:13) BARCLAYS(计数:7)

ROYAL BANK OF CANADA
ROYAL BANK OF CANADA
THE ROYAL BANK OF SCOTLAND PLC
THE ROYAL BANK OF SCOTLAND PLC
ROYAL BANK OF CANADA CAYMAN ISLANDS
RBC ROYAL BANK (TRINIDAD AND TOBAGO), LTD.
RBC ROYAL BANK (TRINIDAD AND TOBAGO), LTD.
THE ROYAL BANK OF SCOTLAND INTERNATIONAL, LTD.
THE ROYAL BANK OF SCOTLAND INTERNATIONAL LTD.
ROYAL BANK OF SCOTLAND, N.V.
RBC ROYAL BANK (BAHAMAS), LTD.
ROYAL BANK OF SCOTLAND PLC
ROYAL BANK OF SCOTLAND PLC
BARCLAYS BANK PLC
BARCLAYS BANK DELAWARE
BARCLAYS BANK OF GHANA, LTD.
BARCLAYS BANK DELAWARE
BARCLAYCARD GERMANY
BARCLAYS BANK PLC
BARCLAYS BANK PLC

还有其他银行也有类似的模式,我想有一个通用的方法来识别列表中的唯一组(银行名称),并将这些组合在一起。

1 个答案:

答案 0 :(得分:1)

你想要这样的东西吗?

[ ROYAL BANK ]
ROYAL BANK OF CANADA
ROYAL BANK OF CANADA
THE ROYAL BANK OF SCOTLAND PLC
THE ROYAL BANK OF SCOTLAND PLC
ROYAL BANK OF CANADA CAYMAN ISLANDS
RBC ROYAL BANK (TRINIDAD AND TOBAGO), LTD.
RBC ROYAL BANK (TRINIDAD AND TOBAGO), LTD.
THE ROYAL BANK OF SCOTLAND INTERNATIONAL, LTD.
THE ROYAL BANK OF SCOTLAND INTERNATIONAL LTD.
ROYAL BANK OF SCOTLAND, N.V.
RBC ROYAL BANK (BAHAMAS), LTD.
ROYAL BANK OF SCOTLAND PLC
ROYAL BANK OF SCOTLAND PLC

[ BARCLAY ]
BARCLAYS BANK PLC
BARCLAYS BANK DELAWARE
BARCLAYS BANK OF GHANA, LTD.
BARCLAYS BANK DELAWARE
BARCLAYCARD GERMANY
BARCLAYS BANK PLC
BARCLAYS BANK PLC

使用正则表达式

(?m)^\s*([A-Z\s]*?(?:(ROYAL BANK)|(BARCLAY)).*)$

Demo,,,,其中匹配的银行名称被捕获到group 1,而detected keywordROYAL BANK, BARCLAY)被捕获到group 2group 3使用它们在python脚本中按名称对银行进行分类。

以下python脚本可以解释一些关于name classification你想要的基本概念。

import re
ss=""" copy & paste sample text in this area """

royalbank=[]
barclay=[]
regx= re.compile(r'(?m)^\s*([A-Z\s]*?(?:(ROYAL BANK)|(BARCLAY)).*)$')
matching=regx.findall(ss)
for m in matching:
    if m[1] !="":
        royalbank.append(m[0])
    elif m[2] !="":
        barclay.append(m[0])

print("\n[ ROYAL BANK ]")
for e in royalbank: print(e)
print("\n[ BARCLAY ]")
for e in barclay: print(e)