在Python中运行大型函数时输入Error

时间:2013-12-04 20:29:03

标签: python csv split

我正在尝试运行一个大型函数,它将处理一个大文本文件,将其分为扬声器和它们的语音,然后将语音进一步处理成组件段落。这是代码:

import os
import re
import csv
from bs4 import BeautifulSoup

def driver(folder, input_filename, output_filename1, output_filename2):
    os.chdir(folder)
    with open(input_filename, 'r') as f:
        Hearing = f.read()
    hearing = BeautifulSoup(Hearing)
    hearing = hearing.get_text()
    hearing = hearing.split("RESPONSE TO WRITTEN")
    str (hearing)
    speakers = re.findall("\\n    Mr. [A-Z][a-z]+\.|\\n    Ms. [A-Z][a-z]+\.|\\n    Congressman [A-Z][a-z]+\.|\\n   Congresswoman [A-Z][a-z]+\.|\\n   Chairwoman [A-Z][a-z]+\.|\\n   Chairman [A-Z][a-z]+\.", hearing)
    speakers = list(set(speakers))
    #print speakers
    position = []
    for speaker in speakers:
        x = hearing.find(speakers)
        position.append(x)
        def find_speaker(hearing, speakers):
            position = []
            for speaker in speakers:
                x = hearing.find(speaker)
                if x==-1:
                    x += 1000000
                position.append(x)
                first = min(position)
                name = speakers[position.index(min(position))]
            name_length = len(name)
            chunk = [name, hearing[0:first], hearing[first+name_length:]]
            #return chunk
            chunks = []
            #print hearing
            names = []
            while len(hearing)>10:
                chunk_try = find_speaker(hearing, speakers)
                hearing = chunk_try[2]
                chunks.append(chunk_try[1])
                names.append(chunk_try[0].strip())
                print len(hearing)#0
                chunks.append(hearing)
                chunks = chunks[1:]
                print len(names) 
                print len(chunks)
                data = zip(names, chunks)
                with open(output_filename1,'wb') as f:
                    w=csv.writer(f)
                    w.writerow(['Speaker','Speech'])
                    for row in data:
                        w.writerow(row)
                        paragraphs = str(chunks)
                        print (paragraphs)
                        Paragraphs = paragraphs.split("\\n")
                        data1 = zip(Paragraphs)
                        with open(output_filename2,'wb') as f:
                            w=csv.writer(f)
                            w.writerow(['Paragraphs'])
                            for row in data1:
                                w.writerow(row)
                                return True 
driver("C:/Users/Documents/Congressional Hearings/NHTF Project/Test Set", 'CHRG-107hhrg70750.htm', 'CHRG-107hhrg70750.csv', 'Paragraphs.csv')

但是,当我运行驱动程序函数时,我收到以下错误:

Traceback (most recent call last):
  File "<pyshell#159>", line 1, in <module>
    driver("C:/Users/mboogie/Documents/Congressional Hearings/NHTF Project/Test Set", 'CHRG-107hhrg70750.htm', 'CHRG-107hhrg70750.csv', 'Paragraphs.csv')
  File "<pyshell#158>", line 9, in driver
    speakers = re.findall("\\n    Mr. [A-Z][a-z]+\.|\\n    Ms. [A-Z][a-z]+\.|\\n    Congressman [A-Z][a-z]+\.|\\n   Congresswoman [A-Z][a-z]+\.|\\n   Chairwoman [A-Z][a-z]+\.|\\n   Chairman [A-Z][a-z]+\.", hearing)
  File "C:\Python27\lib\re.py", line 177, in findall
    return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer

我认为这是指文件'听觉'没有带字符串,但是当我尝试str(听觉)它没有解决错误。我也很困惑为什么它引用了三行独立的代码。任何建议都会受到赞赏 - 我已经坚持了很长一段时间了!

3 个答案:

答案 0 :(得分:2)

您的代码结构有点令人困惑,但我会尝试解释发生了什么。

当你走到这一行时:

speakers = re.findall("\\n    Mr. [A-Z][a-z]+\.|\\n    Ms. [A-Z][a-z]+\.|\\n    Congressman [A-Z][a-z]+\.|\\n   Congresswoman [A-Z][a-z]+\.|\\n   Chairwoman [A-Z][a-z]+\.|\\n   Chairman [A-Z][a-z]+\.", hearing)

hearing是一个列表,因为您使用str.split

将其分为两行
hearing = hearing.split("RESPONSE TO WRITTEN")

因此,您会收到错误,因为re.findall不支持将列表作为其第二个参数。相反,它需要一个字符串或缓冲区。


现在,问题就在于此。解决方案是使re.findall的第二个参数成为字符串。字符串来自哪里取决于你想做什么。

从这条线来判断:

str (hearing)

你想把列表hearing变成自己的字符串表示。如果是这样,那么您需要像这样重新分配hearing

hearing = str(hearing)

答案 1 :(得分:1)

你已经把所有东西放在一个整体的代码块中,这使得测试或修改变得更加困难。我把它重写如下:

from bs4 import BeautifulSoup
from collections import namedtuple
import csv
from itertools import tee, izip
import os, os.path
import re

DIR       = r'C:\Users\Documents\Congressional Hearings\NHTF Project\Test Set'
HARD_WRAP = re.compile(r'\n(?!    )')
SPEAKERS  = re.compile(r'^    (Mr.|Mrs.|Congressman|Congresswoman|Chairman|Chairwoman) ([a-zA-Z \-]{2,40})\.', re.MULTILINE)
NAME      = lambda m: '{0} {1}'.format(*m.groups())
Speaker   = namedtuple('Speaker', ['name', 'name_start', 'name_end'])

def load_hearing_response(fname, split_on='    Present:'):
    with open(fname, 'rU') as inf:
        html = inf.read()
    txt  = BeautifulSoup(html).get_text()
    return txt.rsplit(split_on, 1)[-1]     # return everything after last occurrence of split_on

def un_hard_wrap(txt, reg=HARD_WRAP):
    return reg.sub('', txt)

def pairwise(iterable):
    a,b = tee(iterable)
    next(b, None)
    return izip(a, b)

def get_speeches(txt):
    speakers = [Speaker(NAME(sp), sp.start(), sp.end()) for sp in SPEAKERS.finditer(txt)]
    speakers.append(Speaker('', len(txt), None))    # tail sentinel for pairwise processing
    return [(this.name, txt[this.name_end:nxt.name_start]) for this,nxt in pairwise(speakers)]

def write_csv(fname, data, header=None):
    with open(fname, 'wb') as outf:
        out_csv = csv.writer(outf)
        if header is not None:
            out_csv.writerow(header)
        out_csv.writerows(data)

def main():
    # get text of Congressional hearing responses
    txt = load_hearing_response(os.path.join(DIR, 'CHRG-107hhrg70750.htm'))
    txt = un_hard_wrap(txt)
    # break into speeches
    speeches = get_speeches(txt)
    # write (speaker, speech) pairs to a .csv file
    write_csv(os.path.join(DIR, 'CHRG-107hhrg70750.csv'), speeches, ['Speaker', 'Speech'])
    # write paragraphs of speeches to a .csv file
    paragraphs = ([para.strip()] for speaker,speech in speeches for para in speech.split('\n') if para.strip())
    write_csv(os.path.join(DIR, 'Paragraphs.csv'), paragraphs, ['Paragraphs'])

if __name__=="__main__":
    main()

请注意,这是未经测试的,因为我没有原始数据文件。

修改:在指向a sample data file后,我做了以下更改:

  1. 文字很难包装;我添加了一个un_hard_wrap()函数来转换回未打包的文本(每个段落后跟'\ n')。

  2. 我在get_speeches()中犯了错误,使用sp.pos代替sp.start()sp.end_pos代替sp.end()。现在已经修复了。

  3. 我调整了SPEAKERS正则表达式以消除一些误报(即一位发言人说'议长先生,我冒犯了......'并被'先生发现'演讲者'。)现在应该解决这个问题 - 除非他们以40个字符以下的句子开头。如果您知道最长的说话者姓氏,则可以适当调整SPEAKERS正则表达式,即{2,40}可以变为{2,26}或任何适当的最大长度。

  4. 我在... if para.strip()理解中添加paragraphs以删除空段落。

答案 2 :(得分:0)

hearing = hearing.split("RESPONSE TO WRITTEN")
str (hearing)

str.split()返回字符串列表。然后,当您调用str()将其强制转换为字符串时,不会将返回值分配给任何名称。尝试:

hearing = str(hearing)

或者,更好的是,找出你将字符串分成所需列表的哪个元素,并将其传递给re.findall