我正在尝试运行一个大型函数,它将处理一个大文本文件,将其分为扬声器和它们的语音,然后将语音进一步处理成组件段落。这是代码:
import os
import re
import csv
from bs4 import BeautifulSoup
def driver(folder, input_filename, output_filename1, output_filename2):
os.chdir(folder)
with open(input_filename, 'r') as f:
Hearing = f.read()
hearing = BeautifulSoup(Hearing)
hearing = hearing.get_text()
hearing = hearing.split("RESPONSE TO WRITTEN")
str (hearing)
speakers = re.findall("\\n Mr. [A-Z][a-z]+\.|\\n Ms. [A-Z][a-z]+\.|\\n Congressman [A-Z][a-z]+\.|\\n Congresswoman [A-Z][a-z]+\.|\\n Chairwoman [A-Z][a-z]+\.|\\n Chairman [A-Z][a-z]+\.", hearing)
speakers = list(set(speakers))
#print speakers
position = []
for speaker in speakers:
x = hearing.find(speakers)
position.append(x)
def find_speaker(hearing, speakers):
position = []
for speaker in speakers:
x = hearing.find(speaker)
if x==-1:
x += 1000000
position.append(x)
first = min(position)
name = speakers[position.index(min(position))]
name_length = len(name)
chunk = [name, hearing[0:first], hearing[first+name_length:]]
#return chunk
chunks = []
#print hearing
names = []
while len(hearing)>10:
chunk_try = find_speaker(hearing, speakers)
hearing = chunk_try[2]
chunks.append(chunk_try[1])
names.append(chunk_try[0].strip())
print len(hearing)#0
chunks.append(hearing)
chunks = chunks[1:]
print len(names)
print len(chunks)
data = zip(names, chunks)
with open(output_filename1,'wb') as f:
w=csv.writer(f)
w.writerow(['Speaker','Speech'])
for row in data:
w.writerow(row)
paragraphs = str(chunks)
print (paragraphs)
Paragraphs = paragraphs.split("\\n")
data1 = zip(Paragraphs)
with open(output_filename2,'wb') as f:
w=csv.writer(f)
w.writerow(['Paragraphs'])
for row in data1:
w.writerow(row)
return True
driver("C:/Users/Documents/Congressional Hearings/NHTF Project/Test Set", 'CHRG-107hhrg70750.htm', 'CHRG-107hhrg70750.csv', 'Paragraphs.csv')
但是,当我运行驱动程序函数时,我收到以下错误:
Traceback (most recent call last):
File "<pyshell#159>", line 1, in <module>
driver("C:/Users/mboogie/Documents/Congressional Hearings/NHTF Project/Test Set", 'CHRG-107hhrg70750.htm', 'CHRG-107hhrg70750.csv', 'Paragraphs.csv')
File "<pyshell#158>", line 9, in driver
speakers = re.findall("\\n Mr. [A-Z][a-z]+\.|\\n Ms. [A-Z][a-z]+\.|\\n Congressman [A-Z][a-z]+\.|\\n Congresswoman [A-Z][a-z]+\.|\\n Chairwoman [A-Z][a-z]+\.|\\n Chairman [A-Z][a-z]+\.", hearing)
File "C:\Python27\lib\re.py", line 177, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer
我认为这是指文件'听觉'没有带字符串,但是当我尝试str(听觉)它没有解决错误。我也很困惑为什么它引用了三行独立的代码。任何建议都会受到赞赏 - 我已经坚持了很长一段时间了!
答案 0 :(得分:2)
您的代码结构有点令人困惑,但我会尝试解释发生了什么。
当你走到这一行时:
speakers = re.findall("\\n Mr. [A-Z][a-z]+\.|\\n Ms. [A-Z][a-z]+\.|\\n Congressman [A-Z][a-z]+\.|\\n Congresswoman [A-Z][a-z]+\.|\\n Chairwoman [A-Z][a-z]+\.|\\n Chairman [A-Z][a-z]+\.", hearing)
hearing
是一个列表,因为您使用str.split
hearing = hearing.split("RESPONSE TO WRITTEN")
因此,您会收到错误,因为re.findall
不支持将列表作为其第二个参数。相反,它需要一个字符串或缓冲区。
现在,问题就在于此。解决方案是使re.findall
的第二个参数成为字符串。字符串来自哪里取决于你想做什么。
从这条线来判断:
str (hearing)
我想你想把列表hearing
变成自己的字符串表示。如果是这样,那么您需要像这样重新分配hearing
:
hearing = str(hearing)
答案 1 :(得分:1)
你已经把所有东西放在一个整体的代码块中,这使得测试或修改变得更加困难。我把它重写如下:
from bs4 import BeautifulSoup
from collections import namedtuple
import csv
from itertools import tee, izip
import os, os.path
import re
DIR = r'C:\Users\Documents\Congressional Hearings\NHTF Project\Test Set'
HARD_WRAP = re.compile(r'\n(?! )')
SPEAKERS = re.compile(r'^ (Mr.|Mrs.|Congressman|Congresswoman|Chairman|Chairwoman) ([a-zA-Z \-]{2,40})\.', re.MULTILINE)
NAME = lambda m: '{0} {1}'.format(*m.groups())
Speaker = namedtuple('Speaker', ['name', 'name_start', 'name_end'])
def load_hearing_response(fname, split_on=' Present:'):
with open(fname, 'rU') as inf:
html = inf.read()
txt = BeautifulSoup(html).get_text()
return txt.rsplit(split_on, 1)[-1] # return everything after last occurrence of split_on
def un_hard_wrap(txt, reg=HARD_WRAP):
return reg.sub('', txt)
def pairwise(iterable):
a,b = tee(iterable)
next(b, None)
return izip(a, b)
def get_speeches(txt):
speakers = [Speaker(NAME(sp), sp.start(), sp.end()) for sp in SPEAKERS.finditer(txt)]
speakers.append(Speaker('', len(txt), None)) # tail sentinel for pairwise processing
return [(this.name, txt[this.name_end:nxt.name_start]) for this,nxt in pairwise(speakers)]
def write_csv(fname, data, header=None):
with open(fname, 'wb') as outf:
out_csv = csv.writer(outf)
if header is not None:
out_csv.writerow(header)
out_csv.writerows(data)
def main():
# get text of Congressional hearing responses
txt = load_hearing_response(os.path.join(DIR, 'CHRG-107hhrg70750.htm'))
txt = un_hard_wrap(txt)
# break into speeches
speeches = get_speeches(txt)
# write (speaker, speech) pairs to a .csv file
write_csv(os.path.join(DIR, 'CHRG-107hhrg70750.csv'), speeches, ['Speaker', 'Speech'])
# write paragraphs of speeches to a .csv file
paragraphs = ([para.strip()] for speaker,speech in speeches for para in speech.split('\n') if para.strip())
write_csv(os.path.join(DIR, 'Paragraphs.csv'), paragraphs, ['Paragraphs'])
if __name__=="__main__":
main()
请注意,这是未经测试的,因为我没有原始数据文件。
修改:在指向a sample data file后,我做了以下更改:
文字很难包装;我添加了一个un_hard_wrap()
函数来转换回未打包的文本(每个段落后跟'\ n')。
我在get_speeches()
中犯了错误,使用sp.pos
代替sp.start()
和sp.end_pos
代替sp.end()
。现在已经修复了。
我调整了SPEAKERS
正则表达式以消除一些误报(即一位发言人说'议长先生,我冒犯了......'并被'先生发现'演讲者'。)现在应该解决这个问题 - 除非他们以40个字符以下的句子开头。如果您知道最长的说话者姓氏,则可以适当调整SPEAKERS正则表达式,即{2,40}
可以变为{2,26}
或任何适当的最大长度。
我在... if para.strip()
理解中添加paragraphs
以删除空段落。
答案 2 :(得分:0)
hearing = hearing.split("RESPONSE TO WRITTEN")
str (hearing)
str.split()
返回字符串列表。然后,当您调用str()
将其强制转换为字符串时,不会将返回值分配给任何名称。尝试:
hearing = str(hearing)
或者,更好的是,找出你将字符串分成所需列表的哪个元素,并将其传递给re.findall
。