如何识别字符串数据集中的文本模板模式?

时间:2019-04-14 06:24:05

标签: python algorithm text text-classification

我试图找到一种有效的方法来处理文本记录列表并标识记录中常用的文本模板,仅保留固定部分并提取变量,还对与每个标识的模板匹配的记录数进行计数。 / p>

-

应对挑战的最成功尝试是将文本记录分成单词数组,比较每个单词大小相同的单词数组,以将找到的模板写到模板列表中。

如您所料,它并不完美,并且难以运行超过50,000条记录的数据集。

我想知道是否有一些文本分类库可以更高效或更快速地提高性能,我目前的代码很幼稚...

-

这是我使用非常简单的逻辑在Python中的首次尝试。

samples = ['Your order 12345 has been confirmed. Thank you',
'Your order 12346 has been confirmed. Thank you',
'Your order 12347 has been confirmed. Thank you',
'Your order 12348 has been confirmed. Thank you',
'Your order 12349 has been confirmed. Thank you',
'The code for your bakery purchase is 1234',
'The code for your bakery purchase is 1237',
'The code for your butcher purchase is 1232',
'The code for your butcher purchase is 1231',
'The code for your gardening purchase is 1235']

samples_split = [x.split() for x in samples]
identified_templates = []

for words_list in samples_split:
    for j,words_list_ref in enumerate(samples_split):
         template = str()
         if len(words_list) != len(words_list_ref) or words_list==words_list_ref:
            continue
         else:
            for i,word in enumerate(words_list):
                if word == words_list_ref[i]:
                    template += ' '+word
                else:
                    template += ' %'
            identified_templates.append(template)

templates = dict()          
for template in identified_templates:
    if template not in templates.keys():
        templates[template]=1

templates_2 = dict()

for key, value in templates.items():
    if '% % %' not in key:
        templates_2[key]=1

print(templates_2)  

理想情况下,代码应采用如下所示的输入:

- “Your order tracking number is 123” 
- “Thank you for creating an account with us” 
- “Your order tracking number is 888”
- “Thank you for creating an account with us” 
- “Hello Jim, what is your issue?”
- “Hello Jack, what is your issue?”

并输出模板列表以及它们匹配的记录数。

- “Your order tracking number is {}”,2
- “Thank you for creating an account with us”,2
- “Hello {}, what is your issue?”,2 

1 个答案:

答案 0 :(得分:0)

您可以尝试以下代码。希望输出结果符合您的期望。

import re
templates_2 = {}
samples = ['Your order 12345 has been confirmed. Thank you',
'Your order 12346 has been confirmed. Thank you',
'Your order 12347 has been confirmed. Thank you',
'Your order 12348 has been confirmed. Thank you',
'Your order 12349 has been confirmed. Thank you',
'The code for your bakery purchase is 1234',
'The code for your bakery purchase is 1237',
'The code for your butcher purchase is 1232',
'The code for your butcher purchase is 1231',
'The code for your gardening purchase is 1235']

identified_templates = [re.sub('[0-9]+', '{}', asample) for asample in samples]
unique_identified_templates = list(set(identified_templates))
for atemplate in unique_identified_templates:
    templates_2.update({atemplate:identified_templates.count(atemplate)})
for k, v in templates_2.items():
    print(k,':',v)

输出:

The code for your gardening purchase is {} : 1
Your order {} has been confirmed. Thank you : 5
The code for your bakery purchase is {} : 2
The code for your butcher purchase is {} : 2