使用Python在文件大小的字符串中查找数组字符串的频率

时间:2013-12-24 16:30:36

标签: python regex arrays string

我看了很多答案,这些答案旨在找到文件中每个单词的出现,或大字符串甚至数组。但是我不想这样做,我的字符串也不是来自文本文件。

给定一个大字符串,就像文件大小的字符串一样,你如何计算大字符串中每个数组元素的频率 - 包括单词中的空格?

def calculate_commonness(context, links):
    c = Counter()
    content = context.translate(string.maketrans("",""), string.punctuation).split(None)

    for word in content:
        if word in links:
            c[word] += 1
    print c

context = "It was November. Although it was November November November Passage not yet late, the sky was dark when I turned into Laundress Passage. Father had finished for the day, switched off the shop lights and closed the shutters; but so I would not come home to darkness he had left on the light over the stairs to the flat. Through the glass in the door it cast a foolscap rectangle of paleness onto the wet pavement, and it was while I was standing in that rectangle, about to turn my key in the door, that I first saw the letter. Another white rectangle, it was on the fifth step from the bottom, where I couldn\'t miss it."
links = ['November', 'Laundress', 'Passage', 'Father had']

# My output should look (something) like this:
# November = 4
# Laundress = 1
# Passage = 2
# Father had = 1

目前正在寻找11月,Laundress和Passage,但不是'父亲'。我需要能够找到带空格的字符串元素。我知道这是因为我将上下文拆分为“”返回“父”“有”,所以如何适当地拆分上下文或者我使用regex findall?

编辑: 使用上下文作为一个大字符串我有:

    for l in links:
        c[l] = context.lower().count(l)
    print c

返回:

Counter({'Laundress': 0, 'November': 0, 'Father had': 0, 'Passage': 0})

3 个答案:

答案 0 :(得分:3)

你试过吗

context.lower()
counts = {word: context.count(word)
          for word in links}

注意:将context保留为字符串。

答案 1 :(得分:1)

试试这个......

>>> import re
>>> for word in links:
    print word+ '=' + str(len([w.start() for w in re.finditer(word, context)]))


November=4
Laundress=1
Passage=2
Father had=1
>>> 

你也可以使用ignore case

 for word in links:
         print word+ '=' + str(len([w.start() for w in re.finditer(word, context, re.IGNORECASE)]))

答案 2 :(得分:0)

这是使用regex findall的实现。

import re
links = ['November', 'Laundress', 'Passage', 'Father had']
# Create a big regex catching all the links 
# Something like: "(November)|(Laundress)|(Passage)|(Father had)"
regex = "|".join(map(lambda x: "(" + x + ")", links))

context = "It was November. Although it was November November November Passage not yet late, the sky was dark when I turned into Laundress Passage. Father had finished for the day, switched off the shop lights and closed the shutters; but so I would not come home to darkness he had left on the light over the stairs to the flat. Through the glass in the door it cast a foolscap rectangle of paleness onto the wet pavement, and it was while I was standing in that rectangle, about to turn my key in the door, that I first saw the letter. Another white rectangle, it was on the fifth step from the bottom, where I couldn\'t miss it."

result = re.findall(regex, context)
# Result here is:
# [('November', '', '', ''), ('November', '', '', ''), ('November', '', '', ''), ('November', '', '', ''), ('', '', 'Passage', ''), ('', 'Laundress', '', ''), ('', '', 'Passage', ''), ('', '', '', 'Father had')]

# Now we count regex matches
counts = [0] * len(links)
for x in result:
    for i in range(len(links)):
        if not x[i] == "":
             counts[i] += 1