python:在string中查找第一个字符串

时间:2016-03-04 17:30:55

标签: python string match

给定一个字符串和一个子字符串列表,我想要第一个位置,任何子字符串都出现在字符串中。如果没有出现子字符串,则返回0.我想忽略大小写。

是否有比pythonic更多的东西:

given = 'Iamfoothegreat'
targets = ['foo', 'bar', 'grea', 'other']
res = len(given)
for t in targets:
    i = given.lower().find(t)
    if i > -1 and i < res:
        res = i

if res == len(given):
    result = 0
else:
    result = res

该代码有效,但似乎效率低下。

5 个答案:

答案 0 :(得分:2)

我不会返回0,因为它可能是起始索引,要么使用-1,None或其他一些不可能的值,你可以简单地使用try / except并返回索引:

    // Parse text to separate words
    String INPUT_TEXT = "Hello World! Hello All! Hi World!";
    // Create Multiset
    Multiset<String> multiset = LinkedHashMultiset.create(Arrays.asList(INPUT_TEXT.split(" ")));

    // Print count words
    System.out.println(multiset); // print [Hello x 2, World! x 2, All!, Hi]- in predictable iteration order
    // Print all unique words
    System.out.println(multiset.elementSet());    // print [Hello, World!, All!, Hi] - in predictable iteration order

    // Print count occurrences of words
    System.out.println("Hello = " + multiset.count("Hello"));    // print 2
    System.out.println("World = " + multiset.count("World!"));    // print 2
    System.out.println("All = " + multiset.count("All!"));    // print 1
    System.out.println("Hi = " + multiset.count("Hi"));    // print 1
    System.out.println("Empty = " + multiset.count("Empty"));    // print 0

    // Print count all words
    System.out.println(multiset.size());    //print 6

    // Print count unique words
    System.out.println(multiset.elementSet().size());    //print 4

如果你想忽略输入字符串的大小写,那么在循环之前设置def get_ind(s, targ): s = s.lower() for t in targets: try: return s.index(t.lower()) except ValueError: pass return None # -1, False ...

您还可以执行以下操作:

s = s.lower()

但是,对于每个子字符串而言,最糟糕的是两次查找,而不是使用try / except。它至少也会在第一场比赛中发生短路。

如果你真的想要所有的分钟,那么改为:

def get_ind_next(s, targ):
   s = s.lower() 
   return next((s.index(t) for t in map(str.lower,targ) if t in s), None)

def get_ind(s, targ): s = s.lower() mn = float("inf") for t in targ: try: i = s.index(t.lower()) if i < mn: mn = i except ValueError: pass return mn def get_ind_next(s, targ): s = s.lower() return min((s.index(t) for t in map(str.lower, targ) if t in s), default=None) 仅适用于python&gt; = 3.4所以如果你使用的是python2,那么你将不得不稍微改变逻辑。

Timings python3:

default=None

Python2:

In [29]: s = "hello world" * 5000
In [30]:  s += "grea" + s
In [25]: %%timeit
   ....: targ = [re.escape(x) for x in targets]
   ....: pattern = r"%(pattern)s" % {'pattern' : "|".join(targ)}
   ....: firstMatch = next(re.finditer(pattern, s, re.IGNORECASE),None)
   ....: if firstMatch:
   ....:     pass
   ....: 
100 loops, best of 3: 5.11 ms per loop
In [18]: timeit get_ind_next(s, targets)
1000 loops, best of 3: 691 µs per loop

In [19]: timeit get_ind(s, targets)
1000 loops, best of 3: 627 µs per loop

In [20]:  timeit  min([s.lower().find(x.lower()) for x in targets if x.lower() in s.lower()] or [0])
1000 loops, best of 3: 1.03 ms per loop

In [21]: s = 'Iamfoothegreat'
In [22]: targets = ['bar', 'grea', 'other','foo']
In [23]: get_ind_next(s, targets) == get_ind(s, targets) ==  min([s.lower().find(x.lower()) for x in targets if x.lower() in s.lower()] or [0])
Out[24]: True

你也可以将第一个与min结合起来:

In [13]: s = "hello world" * 5000
In [14]:  s += "grea" + s

In [15]: targets = ['foo', 'bar', 'grea', 'other']
In [16]: timeit get_ind(s, targets)1000 loops, 
best of 3: 322 µs per loop

In [17]:  timeit  min([s.lower().find(x.lower()) for x in targets if x.lower() in s.lower()] or [0])
1000 loops, best of 3: 710 µs per loop

In [18]: get_ind(s, targets) ==  min([s.lower().find(x.lower()) for x in targets if x.lower() in s.lower()] or [0])
Out[18]: True

同样的工作,它只是更好一点,也许稍快一点:

def get_ind(s, targ):
    s,mn = s.lower(), None
    for t in targ:
        try:
            mn = s.index(t.lower())
            yield mn
        except ValueError:
            pass
    yield mn

答案 1 :(得分:2)

Use regex

Another example just use regex, cause think the python regex implementation is super fast. Not my regex function is

import re

given = 'IamFoothegreat'
targets = ['foo', 'bar', 'grea', 'other']

targets = [re.escape(x) for x in targets]    
pattern = r"%(pattern)s" % {'pattern' : "|".join(targets)}
firstMatch = next(re.finditer(pattern, given, re.IGNORECASE),None)
if firstMatch:
    print firstMatch.start()
    print firstMatch.group()

Output is

3
foo

If nothing is found output is nothing. Should be self explained to check if nothing is found.

Much more normal not really pythonic

Give you the matched string, too

given = 'Iamfoothegreat'.lower()
targets = ['foo', 'bar', 'grea', 'other']

dct = {'pos' : - 1, 'string' : None};
given = given.lower()

for t in targets:
    i = given.find(t)
    if i > -1 and (i < list['pos'] or list['pos'] == -1):
        dct['pos'] = i;
        dct['string'] = t;

print dct

Output is:

{'pos': 3, 'string': 'foo'}

If element is not found:

{'pos': -1, 'string': None}

Performance Comparision of both

with this string and pattern

given = "hello world" * 5000
given += "grea" + given
targets = ['foo', 'bar', 'grea', 'other']

1000 loops with timeit:

regex approach: 4.08629107475 sec for 1000
normal approach: 1.80048894882 sec for 1000

10 loops. Now with much bigger targets (targets * 1000):

normal approach: 4.06895017624 for 10
regex approach: 34.8153910637 for 10

答案 2 :(得分:1)

您可以使用以下内容:

answer = min([given.lower().find(x.lower()) for x in targets 
    if x.lower() in given.lower()] or [0])

演示1

given = 'Iamfoothegreat'
targets = ['foo', 'bar', 'grea', 'other']

answer = min([given.lower().find(x.lower()) for x in targets 
    if x.lower() in given.lower()] or [0])
print(answer)

<强>输出

3

演示2

given = 'this is a different string'
targets = ['foo', 'bar', 'grea', 'other']

answer = min([given.lower().find(x.lower()) for x in targets 
    if x.lower() in given.lower()] or [0])
print(answer)

<强>输出

0

我还认为以下解决方案非常易读:

given = 'the string'
targets = ('foo', 'bar', 'grea', 'other')

given = given.lower()

for i in range(len(given)):
    if given.startswith(targets, i):
        print i
        break
else:
    print -1

答案 3 :(得分:1)

Your code is fairly good, but you can make it a little more efficient by moving the .lower conversion out of the loop: there's no need to repeat it for each target substring. The code can be condensed a little using list comprehensions, although that doesn't necessarily make it faster. I use a nested list comp to avoid calling given.find(t) twice for each t.

I've wrapped my code in a function for easier testing.

def min_match(given, targets):
    given = given.lower()
    a = [i for i in [given.find(t) for t in targets] if i > -1]
    return min(a) if a else None

targets = ['foo', 'bar', 'grea', 'othe']

data = (
    'Iamfoothegreat', 
    'IAMFOOTHEGREAT', 
    'Iamfothgrease',
    'Iamfothgret',
)

for given in data:
    print(given, min_match(given, targets))    

output

Iamfoothegreat 3
IAMFOOTHEGREAT 3
Iamfothgrease 7
Iamfothgret None

答案 4 :(得分:0)

试试这个:

def getFirst(given,targets):
    try:
        return min([i for x in targets for i in [given.find(x)] if not i == -1])
    except ValueError:
        return 0