如何使用Python删除字符串中的重复单词?

时间:2011-10-17 13:08:36

标签: python string duplicates

以下示例:

string1 = "calvin klein design dress calvin klein"

如何删除后两个重复项"calvin""klein"

结果应该是

string2 = "calvin klein design dress"

只应删除第二个副本,并且不应更改单词的顺序!

14 个答案:

答案 0 :(得分:29)

string1 = "calvin klein design dress calvin klein"
words = string1.split()
print (" ".join(sorted(set(words), key=words.index)))

这会根据原始词汇列表中单词的索引对字符串中所有(唯一)单词的集合进行排序。

答案 1 :(得分:16)

def unique_list(l):
    ulist = []
    [ulist.append(x) for x in l if x not in ulist]
    return ulist

a="calvin klein design dress calvin klein"
a=' '.join(unique_list(a.split()))

答案 2 :(得分:8)

在Python 2.7+中,您可以使用collections.OrderedDict

from collections import OrderedDict
s = "calvin klein design dress calvin klein"
print ' '.join(OrderedDict((w,w) for w in s.split()).keys())

答案 3 :(得分:7)

itertools recipes

剪切并粘贴
from itertools import ifilterfalse

def unique_everseen(iterable, key=None):
    "List unique elements, preserving order. Remember all elements ever seen."
    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
    # unique_everseen('ABBCcAD', str.lower) --> A B C D
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in ifilterfalse(seen.__contains__, iterable):
            seen_add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen_add(k)
                yield element

我真的希望他们能够继续,尽快从这些食谱中制作一个模块。我非常希望能够from itertools_recipes import unique_everseen而不是每次需要时使用剪切和粘贴。

像这样使用:

def unique_words(string, ignore_case=False):
    key = None
    if ignore_case:
        key = str.lower
    return " ".join(unique_everseen(string.split(), key=key))

string2 = unique_words(string1)

答案 4 :(得分:5)

string = 'calvin klein design dress calvin klein'

def uniquify(string):
    output = []
    seen = set()
    for word in string.split():
        if word not in seen:
            output.append(word)
            seen.add(word)
    return ' '.join(output)

print uniquify(string)

答案 5 :(得分:2)

您可以使用一组来跟踪已处理的单词。

words = set()
result = ''
for word in string1.split():
    if word not in words:
        result = result + word + ' '
        words.add(word)
print result

答案 6 :(得分:0)

有几个答案非常接近,但还没有完全结束我的所作所为:

def uniques( your_string ):    
    seen = set()
    return ' '.join( seen.add(i) or i for i in your_string.split() if i not in seen )

当然,如果你想要它更清洁或更快,我们可以重构一下:

def uniques( your_string ):    
    words = your_string.split()

    seen = set()
    seen_add = seen.add

    def add(x):
        seen_add(x)  
        return x

    return ' '.join( add(i) for i in words if i not in seen )

我认为第二个版本的性能与您可以获得的少量代码相同。 (可以使用更多代码在输入字符串的单次扫描中完成所有工作,但对于大多数工作负载,这应该足够了。)

答案 7 :(得分:0)

11和2完美地运作:

    s="the sky is blue very blue"
    s=s.lower()
    slist = s.split()
    print " ".join(sorted(set(slist), key=slist.index))

和2

    s="the sky is blue very blue"
    s=s.lower()
    slist = s.split()
    print " ".join(sorted(set(slist), key=slist.index))

答案 8 :(得分:0)

问题:删除字符串中的重复项

<!doctype html>
<html ⚡="" lang="en">
<head>
  <meta charset="utf-8">
  <title>Commerce</title>
  <link rel="canonical" href="https://www.ampstart.com/templates/e-commerce/landing.amp">
  <meta name="viewport" content="width=device-width,minimum-scale=1,initial-scale=1">
  <script async src="https://cdn.ampproject.org/v0.js"></script>
  <script custom-element="amp-bind" src="https://cdn.ampproject.org/v0/amp-bind-0.1.js" async></script>
  <style amp-boilerplate="">body{-webkit-animation:-amp-start 8s steps(1,end) 0s 1 normal both;-moz-animation:-amp-start 8s steps(1,end) 0s 1 normal both;-ms-animation:-amp-start 8s steps(1,end) 0s 1 normal both;animation:-amp-start 8s steps(1,end) 0s 1 normal both}@-webkit-keyframes -amp-start{from{visibility:hidden}to{visibility:visible}}@-moz-keyframes -amp-start{from{visibility:hidden}to{visibility:visible}}@-ms-keyframes -amp-start{from{visibility:hidden}to{visibility:visible}}@-o-keyframes -amp-start{from{visibility:hidden}to{visibility:visible}}@keyframes -amp-start{from{visibility:hidden}to{visibility:visible}}</style><noscript><style amp-boilerplate="">body{-webkit-animation:none;-moz-animation:none;-ms-animation:none;animation:none}</style></noscript>
<style amp-custom="">
    div, input {font-size:120%;margin-top:.5rem}
.ampstart-input {max-width: 100%;width: 100%;font-size: 1rem;line-height: 1.5rem}
.ampstart-input [disabled], .ampstart-input [disabled]+label {opacity: .5}
.ampstart-input [disabled]:focus {outline: 0}
.ampstart-input>input, .ampstart-input>select, .ampstart-input>textarea {width: 100%;margin-top: 1rem;line-height: 1.5rem;border: 0;border-radius: 0;border-bottom: 1px solid #4a4a4a;background: none;color: #000;outline: 0}
.ampstart-input>label {color: #000;pointer-events: none;text-align: left;font-size: 1.125rem;line-height: 1rem;opacity: 1;-webkit-animation: .2s;animation: .2s;-webkit-animation-timing-function: cubic-bezier(.4, 0, .2, 1);animation-timing-function: cubic-bezier(.4, 0, .2, 1);-webkit-animation-fill-mode: forwards;animation-fill-mode: forwards}
.ampstart-input>input:focus, .ampstart-input>select:focus, .ampstart-input>textarea:focus {outline: 0}
.ampstart-input>input:focus::-webkit-input-placeholder, .ampstart-input>select:focus::-webkit-input-placeholder, .ampstart-input>textarea:focus::-webkit-input-placeholder {color:transparent}
.ampstart-input>input:focus::-moz-placeholder, .ampstart-input>select:focus::-moz-placeholder, .ampstart-input>textarea:focus::-moz-placeholder {color:transparent}
.ampstart-input>input:focus:-ms-input-placeholder, .ampstart-input>select:focus:-ms-input-placeholder, .ampstart-input>textarea:focus:-ms-input-placeholder {color:transparent}
 </style>
</head>
<body>
<form method=post target="_top" action-xhr="https://example.com/thankyou.amp.html" custom-validation-reporting="show-all-on-submit" >
<h3>Billing Information</h3>
<div>
<label for="firstname" aria-hidden="true">First name</label>
  <input 
         type="text" 
         value="" 
         name="firstname" 
         id="firstname" 
         placeholder="Billing First Name" 
         autocomplete="given-name" 
         required 
         on="input-debounced:AMP.setState({dfn: event.value})"
    />
</div>
<div>
    <label for="lastname" aria-hidden="true">Last name</label> 
  <input 
         type="text" 
         value="" 
         name="lastname" 
         id="lastname" 
         placeholder="Billing Last name" 
         autocomplete="family-name" 
         required on="input-debounced:AMP.setState({dln: event.value})"
    />
</div>
 <div class="relative mt1 p0 mb3 bold center">
    <input type="checkbox" value="1" 
    name="billNEdest" 
    id="billNEdest" 
    class="borderlt" 
    on="change:AMP.setState({seb:event.checked})" 
    />
    <label for="billNEdest">Check to Ship to a Different Address</label>
</div>
<div hidden [hidden]="seb == true ? false : true ">
<h3>Destination Information</h3>
<div>
    <label for="destfirstname" aria-hidden="true">First name</label> 
  <input 
         type="text" 
         value="Destiny" 
         name="destfirstname" 
         id="destfirstname" 
         placeholder="Destination First name" 
         autocomplete="given-name" 
         required
         [value]="thisdfn != null ? thisdfn : dfn != null ? dfn : ''"    
         on="input-debounced:AMP.setState({thisdfn: event.value})"
    />
</div>
<div>
    <label for="destlastname" aria-hidden="true">Last name</label>
  <input 
         type="text" 
         value="" 
         name="destlastname" 
         id="destlastname" 
         placeholder="Destination Last name" 
         autocomplete="family-name" 
         required 
         [value]="thisdln != null ? thisdln : dln!=null ? dln : ''" 
         on="input-debounced:AMP.setState({thisdln: event.value})" 
    />
</div>
  </div>
<input type="submit" value="Submit" class="ampstart-btn">
</form>
</body></html>

答案 9 :(得分:0)

您可以使用以下代码从文本文件或字符串中删除重复或重复的单词-

from collections import Counter
for lines in all_words:

    line=''.join(lines.lower())
    new_data1=' '.join(lemmatize_sentence(line))
    new_data2 = word_tokenize(new_data1)
    new_data3=nltk.pos_tag(new_data2)

    # below code is for removal of repeated words

    for i in range(0, len(new_data3)):
        new_data3[i] = "".join(new_data3[i])
    UniqW = Counter(new_data3)
    new_data5 = " ".join(UniqW.keys())
    print (new_data5)


    new_data.append(new_data5)


print (new_data)

P.S。 -根据要求进行识别。 希望这会有所帮助!

答案 10 :(得分:0)

您可以简单地通过获取与字符串关联的集合来做到这一点,这是一个数学对象,根据定义,该对象不包含重复的元素。只需将集合中的单词重新组合成字符串即可:

def remove_duplicate_words(string):
    return ' '.join(set(string.split()))

答案 11 :(得分:0)

不使用拆分功能(将对面试有所帮助)

def unique_words2(a):
    words = []
    spaces = ' '
    length = len(a)
    i = 0
    while i < length:
        if a[i] not in spaces:
            word_start = i
            while i < length and a[i] not in spaces:
                i += 1
            words.append(a[word_start:i])
        i += 1
    words_stack = []
    for val in words:  #
        if val not in words_stack:  # We can replace these three lines with this one -> [words_stack.append(val) for val in words if val not in words_stack]
            words_stack.append(val)  #
    print(' '.join(words_stack))  # or return, your choice


unique_words2('calvin klein design dress calvin klein') 

答案 12 :(得分:0)

使用numpy函数 最好为导入添加别名(如np)

Text Wrap

然后您可以像这样 从数组中删除重复项,您可以使用这种方式

import numpy as np

对于您的情况,如果要生成字符串,可以使用

no_duplicates_array = np.unique(your_array)

答案 13 :(得分:-1)

string2 = ' '.join(set(string1.split()))

说明

.split()-这是一种将字符串拆分为列表的方法(不使用空格将其拆分为参数)
set()-它是无序集合的类型,不包括重复项
'separator'.join(list)-表示您希望将参数列表以字符串之间的分隔符连接到字符串