Question

我有一个很大的域名列表（大约六千个），我希望看到哪些词汇的趋势是我们投资组合的粗略概述。

我遇到的问题是列表被格式化为域名，例如：

examplecartrading.com

examplepensions.co.uk

exampledeals.org

examplesummeroffers.com

5996

只是运行字数会带来垃圾。所以我想最简单的方法是在整个单词之间插入空格然后运行一个单词计数。

为了我的理智，我宁愿写下这个。

我知道（非常）小python 2.7但是我对接近这个的任何建议持开放态度，代码的例子真的会有所帮助。我被告知使用简单的字符串trie数据结构将是实现此目的的最简单方法，但我不知道如何在python中实现它。

Answer 1

我们尝试将域名（s）拆分为一组已知单词（words）中的任意数量的单词（而不仅仅是2）。递归ftw！

def substrings_in_set(s, words):
    if s in words:
        yield [s]
    for i in range(1, len(s)):
        if s[:i] not in words:
            continue
        for rest in substrings_in_set(s[i:], words):
            yield [s[:i]] + rest

如果它在words中，则此迭代器函数首先生成调用它的字符串。然后它以各种可能的方式将字符串分成两部分。如果第一部分不在words中，则会尝试下一次拆分。如果是的话，第一部分将被添加到第二部分调用自身的所有结果（可能没有，如[“example”，“cart”，...]）

然后我们建立英语词典：

# Assuming Linux. Word list may also be at /usr/dict/words. 
# If not on Linux, grab yourself an enlish word list and insert here:
words = set(x.strip().lower() for x in open("/usr/share/dict/words").readlines())

# The above english dictionary for some reason lists all single letters as words.
# Remove all except "i" and "u" (remember a string is an iterable, which means
# that set("abc") == set(["a", "b", "c"])).
words -= set("bcdefghjklmnopqrstvwxyz")

# If there are more words we don't like, we remove them like this:
words -= set(("ex", "rs", "ra", "frobnicate"))

# We may also add words that we do want to recognize. Now the domain name
# slartibartfast4ever.co.uk will be properly counted, for instance.
words |= set(("4", "2", "slartibartfast"))

现在我们可以把事情放在一起了：

count = {}
no_match = []
domains = ["examplecartrading.com", "examplepensions.co.uk", 
    "exampledeals.org", "examplesummeroffers.com"]

# Assume domains is the list of domain names ["examplecartrading.com", ...]
for domain in domains:
    # Extract the part in front of the first ".", and make it lower case
    name = domain.partition(".")[0].lower()
    found = set()
    for split in substrings_in_set(name, words):
        found |= set(split)
    for word in found:
        count[word] = count.get(word, 0) + 1
    if not found:
        no_match.append(name)

print count
print "No match found for:", no_match

结果：{'ions': 1, 'pens': 1, 'summer': 1, 'car': 1, 'pensions': 1, 'deals': 1, 'offers': 1, 'trading': 1, 'example': 4}

使用set包含英语词典可以快速进行成员资格检查。 -=会从集合中移除项目，|=会添加该项目。

将all函数与generator expression一起使用可提高效率，因为all会在第一个False上返回。

某些子字符串可以是有效字，既可以是整数也可以是分割，例如“example”/“ex”+“ample”。在某些情况下，我们可以通过排除不需要的单词来解决问题，例如上面的代码示例中的“ex”。对于其他人，如“养老金”/“笔”+“离子”，它可能是不可避免的，当发生这种情况时，我们需要防止字符串中的所有其他单词被多次计数（一次用于“养老金”和一次对于“笔”+“离子”）。我们通过跟踪集合中每个域名的找到单词来设置 - 设置忽略重复项 - 然后在找到所有单词后对其进行计数。

编辑：重组并添加了大量评论。小写强制字符串以避免由于大小写而导致错过。还添加了一个列表，用于跟踪没有匹配单词组合的域名。

NECROMANCY EDIT：更改了子字符串函数，以便更好地扩展。对于长度超过16个字符的域名，旧版本的速度非常慢。仅使用上面的四个域名，我将自己的运行时间从3.6秒提高到0.2秒！

Answer 2

with open('/usr/share/dict/words') as f:
  words = [w.strip() for w in f.readlines()]

def guess_split(word):
  result = []
  for n in xrange(len(word)):
    if word[:n] in words and word[n:] in words:
      result = [word[:n], word[n:]]
  return result


from collections import defaultdict
word_counts = defaultdict(int)
with open('blah.txt') as f:
  for line in f.readlines():
    for word in line.strip().split('.'):
      if len(word) > 3:
        # junks the com , org, stuff
        for x in guess_split(word):
          word_counts[x] += 1

for spam in word_counts.items():
  print '{word}: {count}'.format(word=spam[0],count=spam[1])

这是一种蛮力方法，只试图将域分成2个英文单词。如果域名没有分成2个英文单词，那么它就会被废弃。应该直接扩展它以尝试更多的拆分，但除非你聪明，否则它可能无法与拆分数量很好地扩展。幸运的是，我猜你最多只需要3或4次分割。

输出：

deals: 1
example: 2
pensions: 1

Answer 3

假设您只有几千个标准域，您应该能够在内存中完成所有这些。

domains=open(domainfile)
dictionary=set(DictionaryFileOfEnglishLanguage.readlines())
found=[]
for domain in domains.readlines():
    for substring in all_sub_strings(domain):
        if substring in dictionary:
            found.append(substring)
from collections import Counter
c=Counter(found) #this is what you want

print c

在Python中将字符串分解为单个单词

3 个答案: