Python:删除带有例外列表的标签中的所有字符

时间:2014-01-14 08:25:44

标签: python

所以我想删除标签中的所有字符(主要是字母),但保留单词列表中的单词。

例如,

我想改变

<html>VERY RARE CAR WITH NEW TIRES WHITE</html>

为:

<html>CAR WHITE</html>

这意味着两个单词car和white来自例外列表。

1 个答案:

答案 0 :(得分:0)

我不确定这是你在想什么。我将展示如何使用2个列表,异常单词和html标记来删除所需的任何文本:

#This is to maintain the html tags unmodified
html_tags = ['<a>','</a>','<html>','</html>']  

#Exception words list
word_list = ['WORD1','CAR','WORD2','WHITE','WORD3','WORD4']  
#String you want to split
string = '<html>VERY RARE CAR WITH NEW TIRES WHITE</html>'

#The result string where we concatenate desired words and tags
final_string = ''

#now we change the string to add # before '<' and after '>' so we can split the text by tags
string = string.replace('<','#<')
string = string.replace('>','>#')

string_list = string.split('#')  #Now we have the tags unmodified (<html>,<a>...)

#Now we have:
#string_list = ['', '<html>', 'VERY RARE CAR WITH NEW TIRES WHITE', '</html>', '']

for word in string_list:  #We go over all string_list
    if (word in html_tags):  #If we find a tag, we add it to final_string
        final_string+=word
    else: #If it isn't a tag, it is text, in this case 'VERY RARE CAR WITH NEW TIRES WHITE'
        for word2 in word.split():  #We split by whitespace 
            if word2 in word_list:  #If it is in word_list, we add it to final_string
                final_string+=' '+word2+' '

#The result of this code is final_string with '<html> CAR  WHITE </html>'
#You can manage better the white spaces, and I make the code little complex 
#to make sure it works with different tags, and bigger html code.

希望它有所帮助!!