识别字符串中的字典单词

时间:2016-03-04 16:57:00

标签: python regex string search

我正在编写一个程序来评估密码的强度。我的程序中的一个函数接受输入的密码,并将其与大量的单词和密码进行比较。

这段代码是二进制搜索,可以查看输入的密码是否在密码列表中。

with io.open('PasswordList.txt', encoding='latin-1') as myfile:
        data = myfile.readlines()
        low = 0
        high = (len(data)-1)
        while (low <= high) and not Found:
            mid = int((low+high)/2)
            if data[mid].rstrip() == Password:
                Found = True
                break
            elif Password < str(data[mid]):
                high = mid - 1
            elif Password > str(data[mid]):
                low = mid + 1

这段代码从密码中删除所有数字,将其变成普通字母并再次对照列表进行检查。 &#34; Password123&#34;会变成&#34;密码&#34;和&#34;密码&#34;在列表中。

SimplePassword = ''.join([i for i in Password if not i.isdigit()])
SimplePassword = SimplePassword.lower()

if not Found:
        with io.open('final.txt', encoding='latin-1') as myfile:
            data = myfile.readlines()
            low = 0
            high = (len(data)-1)
            while (low <= high) and not Found:
                mid = int((low+high)/2)
                if data[mid].rstrip() == SimplePassword:
                    PartiallyFound = True
                    break
                elif SimplePassword < str(data[mid]):
                    high = mid - 1
                elif SimplePassword > str(data[mid]):
                    low = mid + 1

我想通过编写一些可以识别字符串中的名称或单词的代码来进一步考虑这一点。例如,&#34; john&#34;在列表和单词&#34; smith&#39;在列表中。但是,输入的密码&#34; JohnSmith123&#34;会在雷达下飞行。

我怎样才能将它分成单独的单词?我正在考虑的一种方法是将大写字母之间的字母附加到数组中,然后单独检查该数组中的每个元素。

但必须有更好的方法。是否有某种方法可以查看输入的密码是否可以根据大型词汇表中的单词变体构建?

2 个答案:

答案 0 :(得分:2)

你可以试试

badness = 0
for word in wordlist:
    if word in passwordString and len(word) > badness:
        badness = len(word)

这样,密码这个词就会受到:

  • 密码

但实际上只会使用“密码”。

答案 1 :(得分:1)

from variations of words inside a large wordlist

There is a tool you can use to construct a regex Trie from your
word list.
You just paste in all the variations into a text box, and it pumps out
a full blown regex trie.

This is probably the fastest lookup there is.

The tool is available in the trial version.

Screen shot Tool.
App runs on Windows only.

Location from main menu is Tools->Ternary Tree

Benchmark

Regex used
Samples

Regex1:
Completed iterations:   1  /  1     ( x 1000 )
Matches found per iteration:   174939
Elapsed Time:    600.30 s,   600296.36 ms,   600296365 µs

Target Sample: All 174,939 words that the regex represents (in random order)

Sample Analysis:

    174,939  words matched / iteration
  x   1,000  iterations
------------------------------
 174,939,000 total words matched
  /      600 total seconds
------------------------------
     291,565 words matched / second         <<<
  /    1,000 miliseconds / second
------------------------------
         292 words matched / milisecond     <<<