我开始学习Python,我正在尝试编写一个导入文本文件的程序,计算单词总数,计算特定段落中的单词数(每个单词表示)参与者,由'P1','P2'等描述,从我的单词计数中排除这些单词(即'P1'等),并分别打印段落。
感谢@James Hurford,我收到了这段代码:
words = None
with open('data.txt') as f:
words = f.read().split()
total_words = len(words)
print 'Total words:', total_words
in_para = False
para_type = None
paragraph = list()
for word in words:
if ('P1' in word or
'P2' in word or
'P3' in word ):
if in_para == False:
in_para = True
para_type = word
else:
print 'Words in paragraph', para_type, ':', len(paragraph)
print ' '.join(paragraph)
del paragraph[:]
para_type = word
else:
paragraph.append(word)
else:
if in_para == True:
print 'Words in last paragraph', para_type, ':', len(paragraph)
print ' '.join(paragraph)
else:
print 'No words'
我的文本文件如下所示:
P1:Bla bla bla。 P2:Bla bla bla bla。P1:Bla bla。
P3:Bla。
我需要做的下一部分是总结每个参与者的话。我只能打印它们,但我不知道如何返回/重复使用它们。
除了总结每个参与者所说的所有单词之外,我还需要一个新的变量,每个参与者都可以使用单词计数。
P1all = sum of words in paragraph
有没有办法将“你是”或“它的”等等算作两个字?
任何想法如何解决?
答案 0 :(得分:5)
我需要一个新的变量,为每个参与者提供单词计数,我可以在以后操作
不,你需要Counter
(Python 2.7+,否则使用defaultdict(int)
)将人员映射到字数。
from collections import Counter
#from collections import defaultdict
words_per_person = Counter()
#words_per_person = defaultdict(int)
for ln in inputfile:
person, text = ln.split(':', 1)
words_per_person[person] += len(text.split())
现在words_per_person['P1']
包含P1
的字数,假设text.split()
是一个足够好的标记器,用于您的目的。 (语言学家不同意 word 的定义,所以你总是得到一个近似值。)
答案 1 :(得分:1)
您可以使用两个变量执行此操作。一个是跟踪说话的人,另一个是为说话的人保留段落。用于存储段落并关联段落所属的段落,使用带有人作为键的字典和用户表示与该键相关联的段落列表。
para_dict = dict()
para_type = None
for word in words:
if ('P1' in word or
'P2' in word or
'P3' in word ):
#extract the part we want leaving off the ':'
para_type = word[:2]
#create a dict with a list of lists
#to contain each paragraph the person uses
if para_type not in para_dict:
para_dict[para_type] = list()
para_dict[para_type].append(list())
else:
#Append the word to the last list in the list of lists
para_dict[para_type][-1].append(word)
从这里你可以总结出这样说的话数
for person, para_list in para_dict.items():
counts_list = list()
for para in para_list:
counts_list.append(len(para))
print person, 'spoke', sum(counts_list), 'words'
答案 2 :(得分:1)
恭喜您开始使用Python进行冒险!不是这篇文章中的所有内容现在都可能有意义,但如果以后看起来有用,请将其加入书签并回归它。最终你应该尝试从脚本转向软件工程,这里有一些想法给你!
强大的功能带来了巨大的责任,作为一名Python开发人员,您需要比其他语言更加自律,而这些语言并不能抓住并强制执行“良好”的设计。
我觉得从自上而下的设计开始是有帮助的。
def main():
text = get_text()
p_text = process_text(text)
catalogue = process_catalogue(p_text)
BOOM!你刚刚编写了整个程序 - 现在你只需要回来填写空白!当你这样做时,它似乎不那么令人生畏了。就个人而言,我并不认为自己很聪明,无法解决很大的问题,但我是一个解决小问题的专业人士。所以我们一次解决一件事。我将从'process_text'开始。
def process_text(text):
b_text = bundle_dialogue_items(text)
f_text = filter_dialogue_items(b_text)
c_text = clean_dialogue_items(f_text)
我不确定那些东西是什么意思,但我知道文本问题往往遵循一种称为“map / reduce”的模式,这意味着你对某些东西执行和操作然后你清理并组合,所以我放了一些占位符函数。如有必要,我可以回去添加更多。
现在让我们写'process_catalogue'。我本来可以写“process_dict”,但这听起来很蹩脚。
def process_catalogue(p_text):
speakers = make_catalogue(c_text)
s_speakers = sum_words_per_paragraph_items(speakers)
t_speakers = total_word_count(s_speakers)
冷却。还不错。你可能会采用与我不同的方法,但我认为聚合项目,计算每段的单词,然后计算所有单词是有意义的。
所以,在这一点上,我可能会制作一个或两个小'lib'(库)模块来回填剩余的函数。为了能够在不担心导入的情况下运行它,我将把它全部放在一个.py文件中,但最终你将学习如何打破它们以便它看起来更好。所以,让我们这样做。
# ------------------ #
# == process_text == #
# ------------------ #
def bundle_dialogue_items(lines):
cur_speaker = None
paragraphs = Counter()
for line in lines:
if re.match(p, line):
cur_speaker, dialogue = line.split(':')
paragraphs[cur_speaker] += 1
else:
dialogue = line
res = cur_speaker, dialogue, paragraphs[cur_speaker]
yield res
def filter_dialogue_items(lines):
for name, dialogue, paragraph in lines:
if dialogue:
res = name, dialogue, paragraph
yield res
def clean_dialogue_items(flines):
for name, dialogue, paragraph in flines:
s_dialogue = dialogue.strip().split()
c_dialouge = [clean_word(w) for w in s_dialogue]
res = name, c_dialouge, paragraph
yield res
aaa和一个小帮手功能
# ------------------- #
# == aux functions == #
# ------------------- #
to_clean = string.whitespace + string.punctuation
def clean_word(word):
res = ''.join(c for c in word if c not in to_clean)
return res
所以它可能并不明显,但这个库被设计为数据处理管道。有几种处理数据的方法,一种是流水线处理,另一种是批处理。我们来看看批量处理。
# ----------------------- #
# == process_catalogue == #
# ----------------------- #
speaker_stats = 'stats'
def make_catalogue(names_with_dialogue):
speakers = {}
for name, dialogue, paragraph in names_with_dialogue:
speaker = speakers.setdefault(name, {})
stats = speaker.setdefault(speaker_stats, {})
stats.setdefault(paragraph, []).extend(dialogue)
return speakers
word_count = 'word_count'
def sum_words_per_paragraph_items(speakers):
for speaker in speakers:
word_stats = speakers[speaker][speaker_stats]
speakers[speaker][word_count] = Counter()
for paragraph in word_stats:
speakers[speaker][word_count][paragraph] += len(word_stats[paragraph])
return speakers
total = 'total'
def total_word_count(speakers):
for speaker in speakers:
wc = speakers[speaker][word_count]
speakers[speaker][total] = 0
for c in wc:
speakers[speaker][total] += wc[c]
return speakers
所有这些嵌套词典都变得有点复杂。在实际的生产代码中,我会用一些更易读的类替换它们(以及添加测试和文档字符串!!),但我不想让它比现在更令人困惑!好的,为了您的方便,下面是整个事情。
import pprint
import re
import string
from collections import Counter
p = re.compile(r'(\w+?):')
def get_text_line_items(text):
for line in text.split('\n'):
yield line
def bundle_dialogue_items(lines):
cur_speaker = None
paragraphs = Counter()
for line in lines:
if re.match(p, line):
cur_speaker, dialogue = line.split(':')
paragraphs[cur_speaker] += 1
else:
dialogue = line
res = cur_speaker, dialogue, paragraphs[cur_speaker]
yield res
def filter_dialogue_items(lines):
for name, dialogue, paragraph in lines:
if dialogue:
res = name, dialogue, paragraph
yield res
to_clean = string.whitespace + string.punctuation
def clean_word(word):
res = ''.join(c for c in word if c not in to_clean)
return res
def clean_dialogue_items(flines):
for name, dialogue, paragraph in flines:
s_dialogue = dialogue.strip().split()
c_dialouge = [clean_word(w) for w in s_dialogue]
res = name, c_dialouge, paragraph
yield res
speaker_stats = 'stats'
def make_catalogue(names_with_dialogue):
speakers = {}
for name, dialogue, paragraph in names_with_dialogue:
speaker = speakers.setdefault(name, {})
stats = speaker.setdefault(speaker_stats, {})
stats.setdefault(paragraph, []).extend(dialogue)
return speakers
def clean_dict(speakers):
for speaker in speakers:
stats = speakers[speaker][speaker_stats]
for paragraph in stats:
stats[paragraph] = [''.join(c for c in word if c not in to_clean)
for word in stats[paragraph]]
return speakers
word_count = 'word_count'
def sum_words_per_paragraph_items(speakers):
for speaker in speakers:
word_stats = speakers[speaker][speaker_stats]
speakers[speaker][word_count] = Counter()
for paragraph in word_stats:
speakers[speaker][word_count][paragraph] += len(word_stats[paragraph])
return speakers
total = 'total'
def total_word_count(speakers):
for speaker in speakers:
wc = speakers[speaker][word_count]
speakers[speaker][total] = 0
for c in wc:
speakers[speaker][total] += wc[c]
return speakers
def get_text():
text = '''BOB: blah blah blah blah
blah hello goodbye etc.
JERRY:.............................................
...............
BOB:blah blah blah
blah blah blah
blah.
BOB: boopy doopy doop
P1: Bla bla bla.
P2: Bla bla bla bla.
P1: Bla bla.
P3: Bla.'''
text = get_text_line_items(text)
return text
def process_catalogue(c_text):
speakers = make_catalogue(c_text)
s_speakers = sum_words_per_paragraph_items(speakers)
t_speakers = total_word_count(s_speakers)
return t_speakers
def process_text(text):
b_text = bundle_dialogue_items(text)
f_text = filter_dialogue_items(b_text)
c_text = clean_dialogue_items(f_text)
return c_text
def main():
text = get_text()
c_text = process_text(text)
t_speakers = process_catalogue(c_text)
# take a look at your hard work!
pprint.pprint(t_speakers)
if __name__ == '__main__':
main()
因此,对于这个应用程序来说,这个脚本几乎肯定是过度杀伤,但关键是要看看(可疑)可读,可维护,模块化的Python代码是什么样的。
非常确定输出类似于:
{'BOB': {'stats': {1: ['blah',
'blah',
'blah',
'blah',
'blah',
'hello',
'goodbye',
'etc'],
2: ['blah',
'blah',
'blah',
'blah',
'blah',
'blah',
'blah'],
3: ['boopy', 'doopy', 'doop']},
'total': 18,
'word_count': Counter({1: 8, 2: 7, 3: 3})},
'JERRY': {'stats': {1: ['', '']}, 'total': 2, 'word_count': Counter({1: 2})},
'P1': {'stats': {1: ['Bla', 'bla', 'bla'], 2: ['Bla', 'bla']},
'total': 5,
'word_count': Counter({1: 3, 2: 2})},
'P2': {'stats': {1: ['Bla', 'bla', 'bla', 'bla']},
'total': 4,
'word_count': Counter({1: 4})},
'P3': {'stats': {1: ['Bla']}, 'total': 1, 'word_count': Counter({1: 1})}}