索引

时间:2017-01-11 14:25:45

标签: python regex algorithm replace text-formatting

我在思考一个好的算法来替换文本中的某些实体时遇到了一些问题。以下是详细信息: 我有一个文本,我需要格式化为HTML,有关格式的信息是在包含实体字典的python列表中。让我们举个例子说原文是这样的(请注意格式化):

Lorem Ipsum 只是printing和排版行业的虚拟文字。

我将得到的文字(没有格式化):

Lorem Ipsum只是印刷和排版行业的虚拟文本。

以及像这样的实体列表:

entities = [{"entity_text":"Lorem Ipsum", "type": "bold", "offset": 0, "length":"11"}, {"entity_text":"dummy", "type": "italic", "offset": 22, "length":"5"},{"entity_text":"printing", "type": "text_link", "offset": 41, "length":"8", "url": "google.com"}]

我的算法应该将给定的无格式文本和实体翻译成这个html:



<b>Lorem Ipsum</b> is simply <i>dummy</i> text of the <a href="google.com">printing</a> and typesetting industry
&#13;
&#13;
&#13;

这样它就可以编译成原始邮件。 我已经尝试过字符串替换,但它会弄乱偏移量(实体从文本开头的位置)。请记住,文本中可能存在许多带有格式的单词,这些单词没有格式化,因此我必须找到应该格式化的单词。任何人的帮助?我在python中编写代码但你可以用任何语言指定算法

修改 抱歉,我忘记发布我尝试过的代码。就是这样:

format_html(text, entities):
    for entity in entities:
        try:
            entity_text = entity['entity_text']
            position = text.find(entity_text, entity['offset'])
            if position == entity['offset']:
                before = text[:position]
                after = text[min(position+entity['length'], len(text)-1):]
                if entity['type'] == 'text_link':
                    text_link = '<a href="{}">{}</a>'.format(entity['url'], entity_text)
                    text = before + text_link + after
                elif entity['type'] == 'code':
                    code = '<code>{}</code>'.format(entity_text)
                    text = before + code + after
                elif entity['type'] == 'bold':
                    bold_text = '<b>{}</b>'.format(entity_text)
                    text = before + bold_text + after
                elif entity['type'] == 'italic':
                    italic_text = '<i>{}</i>'.format(entity_text)
                    text = before + italic_text + after
                elif entity['type'] == 'pre':
                    pre_code = '<pre>{}</pre>'.format(entity_text)
                    text = before + pre_code + after
        except:
            pass

2 个答案:

答案 0 :(得分:1)

你可能会有这样的意思吗?

text = ""
for entry in entries:
    line = ""
    for key, value in entry.iteritems():
        if key == 'entity_text':
            line += value
        elif key == 'type' and value == 'bold':
            line = "<b> {} </b>".format(line)
        elif key == 'type' and value == 'italic':
            line = "<i> {} </i>".format(line)
        elif key == 'type' and value == 'text_link':
            line = '<a href="google.com">{}</a>'.format(line)
    text += line
text   

转换为

'<b> Lorem Ipsum </b><i> dummy </i><a href="google.com">printing</a>'

答案 1 :(得分:0)

嗯,这就是我解决它的方式。每次修改文本时,我都会使用添加到文本中的额外字符串长度来调整偏移量(因为标记)。就计算时间而言,这是昂贵的,但这是我见过的唯一选择

def format_html(text, entities):
    for entity in entities:
        try:
            modified = None
            entity_text = entity['entity_text']
            position = text.find(entity_text, entity['offset'])
            if position == entity['offset']:
                before = text[:position]
                after = text[min(position+entity['length'], len(text)-1):]
                if entity['type'] == 'text_link':
                    text_link = '<a href="{}">{}</a>'.format(entity['url'], entity_text)
                    text = before + text_link + after
                    modified = 15 + len(entity['url'])
                elif entity['type'] == 'code':
                    code = '<code>{}</code>'.format(entity_text)
                    text = before + code + after
                    modified = 13
                elif entity['type'] == 'bold':
                    bold_text = '<b>{}</b>'.format(entity_text)
                    text = before + bold_text + after
                    modified = 7
                elif entity['type'] == 'italic':
                    italic_text = '<i>{}</i>'.format(entity_text)
                    text = before + italic_text + after
                    modified = 7
                elif entity['type'] == 'pre':
                    pre_code = '<pre>{}</pre>'.format(entity_text)
                    text = before + pre_code + after
                    modified = 11
               if modified:
                   for other in entites:
                       if other['offset'] > entity.offset:   
                           other.offset += modified
        except:
            pass