来自python的文件中的文本/单词计数

时间:2012-04-02 16:45:11

标签: python

Chat.txt

ID674 25/01/1986 Thank you for choosing Optimus prime. Please wait for an Optimus prime Representative to respond. You are currently number 0 in the queue. You should be connected to an agent in approximately 0 minutes.. You are now chatting with 'Tom' 0      <br/>
ID674 2gb Hi there! Welcome to Optus Web Chat 0/0/0 . How can I help you today?  1 
ID674 25-01-1986 I would like to change my bill plan from $0 with 0 expiry to something else $136. I find it very unuseful. Sam my phone no is 9838383821   2

在上面提到的文本中只是文件中几行的示例。我的要求是例如25/01/1986或0/0/0的所有日期都应该替换为“DATE123”。
然后:)应该用“smileys123”代替。 货币,即$ 0或$ 136应替换为“Currency123”
'TOM'(通常是单引号中的代理商名称)应替换为AGENT123
等等。输出应该是字符串出现的次数,如图所示

DATE123=2  smileys123=2 Currency123=6 AGENT123=5

我现在有这种方法请告诉我这个,

  class Replace:
     dateformat=DATE123
     smileys=smileys123
     currency=currency123

  count_dict={}

  function count_data(type,count):
     global count_dict
     if type in count_dict:
        count_dict[type]+=count
     else:
        count_dict = {type:count}


  f=open("chat.txt")
  while True:
     for line in f.readlines():
        print line,
        if ":)" in line:
           smileys = line.count(":)")
           count_data("smileys",smileys)
        elif "$number" in line :    #how to see whether it is currency or nor??
           currency=line.count("$number") //how can i do this
           count_data("currecny",currency)
        elif "1/2/3" in line :    #how to validate date format
           dateformat=line.count("dateformat") #how can i do this
           count_data("currency",currency)
        elif validate-agent-name in line:
           agent_name=line.count("agentname")  #How to do this get agentname in single quotes
           count_data("agent_name",agent_name)
     else:
        break
  f.close()

  for keys in count_dict:
     print keys,count_dict[keys]


  The following would be the ouput

  DATE123=2  smileys123=2 Currency123=6 AGENT123=5

2 个答案:

答案 0 :(得分:1)

  

货币即$ 0或$ 136应替换为“Currency123”和“TOM”(通常代理商名称用单引号)应替换为AGENT123和 更多

我认为您的类Repalce应该被字典替换,在这种情况下,您可以在编写更少的代码时执行更多操作(因为它带有方法)。字典可以跟踪您需要更换的内容,并为您提供更多选项,以动态更改您的替换需求。并且这样做,也许你的代码会更清晰,更容易理解?因为你有更多的替代词,所以肯定会更短。

编辑:您可能希望将替换字词列表保留在文本文件中,然后将其加载到词典中。而不是仅仅将替换单词硬编码到类中。我认为这不是一个好主意。既然你说了很多,那么这样做更有意义,写的代码更少(更清洁!)

发表评论......使用

# Here is a comment

如果你想学习更好的编码风格,你的代码风格不是最好的,请阅读http://www.python.org/dev/peps/pep-0008/#pet-peeves,甚至是整章。

这是正则表达式,用于检查它是否为货币,名称为“Tom”,以及日期。

import re

while True:
    myString = input('Enter your string: ')

    isMoney = re.match('^\$[0-9]+(,[0-9]{3})*(\.[0-9]{2})?$', myString)
    isName = re.match('^\'+\w+\'$', myString)
    isDate = re.match('^[0-1][0-9]\/[0-3][0-9]\/[0-1][0-9]{3}$', myString)
    # or try '^[0-1]*?\/[0-9]*\/[0-9]*$ If you want 0/0/0 too...

    if isMoney:
        print('It is Money:', myString)
    elif isName:
        print('It is a Name:', myString)
    elif isDate:
        print('It is a Date:', myString)
    else:
        print('Not good.')

Sanple输出:

Enter your string: $100
It is Money: $100
Enter your string: 100
Not good.
Enter your string: 'Tom'
It is a Name: 'Tom'
Enter your string: Tom
Not good.
Enter your string: 01/15/1989
It is a Date: 01/15/1989
Enter your string: 01151989
Not good.

您可以使用其中一个isSomething变量替换条件,这取决于究竟需要做什么。我想,我希望这会有所帮助。如果您想了解有关正则表达式的更多信息,请查看"Regular Expression Primer"Python's RE Page

答案 1 :(得分:1)

这并不能完成您所说的所有替换。但是这里有一种使用正则表达式和默认字典计算数据内容的方法。如果你真的想要替换字符串,我相信你可以解决这个问题:

lines = [
   "ID674 25/01/1986 Thank you for :) choosing Optimus prime. Please wait for an Optimus prime Representative to respond. You are currently number 0 in the queue. You should be connected to an agent in approximately 0 minutes.. You are now chatting with 'Tom' 0",
  "ID674 2gb Hi there! Welcome to Optus Web Chat 0/0/0 . $5.45 How can I help you today?  1",
  "ID674 25-01-1986 I would like to change my bill plan from $0 with 0 expiry to something else $136. I find it very unuseful. Sam my phone no is 9838383821   2'"
]

import re
from collections import defaultdict

p_smiley = re.compile(r':\)|:-\)')
p_currency = re.compile(r'\$[\d.]+')
p_date = re.compile(r'(\d{1,4}[/-]\d{1,4}[/-]\d{1,4})')

count_dict = defaultdict(int)

def count_data(type, count):
    global count_dict
    count_dict[type] += count

for line in lines:
    count_data('smiley', len(re.findall(p_smiley, line)))
    count_data('date', len(re.findall(p_date, line)))
    count_data('currency', len(re.findall(p_currency, line)))