对于一个研究项目,我想分析收益电话记录。我是编程的新手,因此很难找到合适的设置来满足我的分析要求。
成绩单文档采用以下(简化)形式:
++带有公司名称,行业,年份,季度,日期时间的标头++
发言人1-发言人1的位置:一些文字。
Speaker2-Speaker2的位置:一些文本。
(...)
++文件结尾++
我想处理html文件并将其存储在能够在项目后期满足我的分析要求的数据结构中。我想到了创建三个嵌套类的想法:
Master
Document
对象Document
Paragraph
对象Paragraph
class Master():
def __init__(self):
''' Initialize the master object '''
self.filenames = [] # List of filenames
self.documents = [] # List of documents
def process_documents(self, filenames):
''' Create document object for each document '''
self.filenames.append(filenames)
for filename in filenames:
with open(filename, 'r') as f:
html = f.read()
document = Document(html)
document.process_paragraphs()
self.documents.append(document)
class Document():
def __init__(self, html):
''' Initialize the document object '''
self.html = html # Html code of complete file
self.cname = self.parse_cname() # Company name
self.industry = self.parse_industry() # Industry
self.quartal = self.parse_quartal() # Quartal
self.year = self.parse_year() # Year
self.dtime = self.parse_dtime() # Datetime
self.paragraphs = [] # List of paragraphs
def parse_cname(self):
''' Extract company name from html '''
pass
def parse_industry(self):
''' Extract industry from html '''
pass
def parse_quartal(self):
''' Extract quartal from html '''
pass
def parse_year(self):
''' Extract year from html '''
pass
def process_paragraphs(self):
''' Create paragraph object for each paragraph '''
for portion in self.html:
paragraph = Paragraph(portion)
self.paragraphs.append(paragraph)
class Paragraph():
def __init__(self, html):
''' Initialize the paragraph object '''
self.html = html # Html code of paragraph
self.speaker = parse_speaker() # Speaker
self.position = parse_position() # Position of speaker
self.text = parse_text() # Text
def parse_speaker(self):
''' Extract speaker from paragraph html '''
pass
def parse_position(self):
''' Extract position from paragraph html '''
pass
def parse_text(self):
''' Extract text from paragraph html '''
pass
通过此设置,我希望简化以下分析步骤:
提出的设置是否适合解决给定的问题?
我担心的是,从问题的角度来看,Master
,Document
和Paragraph
级别之间有着很强的联系。但是我的程序代码中的类没有真正的关系,只是它们之间存储着彼此的对象。例如,如果我想实现一个Document
类的方法,该方法返回一个Paragraph
对象的列表,那么我将丢失Paragraph
对象所引用的文档的信息。
是否有可能在我的设置中捕获对象的层次关系?