所以我想删除标签中的所有字符(主要是字母),但保留单词列表中的单词。
例如,
我想改变
<html>VERY RARE CAR WITH NEW TIRES WHITE</html>
为:
<html>CAR WHITE</html>
这意味着两个单词car和white来自例外列表。
答案 0 :(得分:0)
我不确定这是你在想什么。我将展示如何使用2个列表,异常单词和html标记来删除所需的任何文本:
#This is to maintain the html tags unmodified
html_tags = ['<a>','</a>','<html>','</html>']
#Exception words list
word_list = ['WORD1','CAR','WORD2','WHITE','WORD3','WORD4']
#String you want to split
string = '<html>VERY RARE CAR WITH NEW TIRES WHITE</html>'
#The result string where we concatenate desired words and tags
final_string = ''
#now we change the string to add # before '<' and after '>' so we can split the text by tags
string = string.replace('<','#<')
string = string.replace('>','>#')
string_list = string.split('#') #Now we have the tags unmodified (<html>,<a>...)
#Now we have:
#string_list = ['', '<html>', 'VERY RARE CAR WITH NEW TIRES WHITE', '</html>', '']
for word in string_list: #We go over all string_list
if (word in html_tags): #If we find a tag, we add it to final_string
final_string+=word
else: #If it isn't a tag, it is text, in this case 'VERY RARE CAR WITH NEW TIRES WHITE'
for word2 in word.split(): #We split by whitespace
if word2 in word_list: #If it is in word_list, we add it to final_string
final_string+=' '+word2+' '
#The result of this code is final_string with '<html> CAR WHITE </html>'
#You can manage better the white spaces, and I make the code little complex
#to make sure it works with different tags, and bigger html code.
希望它有所帮助!!