Question

我有一个如下的文本文件。

LA English
DT Article
GJ asthma; susceptible genes; natural language processing analysis; network
   centrality analysis
ID LITERATURE-BASED DISCOVERY; CO-WORD ANALYSIS; UNDISCOVERED PUBLIC
   KNOWLEDGE; INFORMATION-RETRIEVAL; FISH-OIL; SCIENTIFIC COLLABORATION;
   INSULIN-RESISTANCE; COMPLEX NETWORKS; METFORMIN; OBESITY
GJ natural language processing; network analysis
GJ data mining; text mining; learning analytics; deep learning;
   network centrality analysis

我想获得GJ条目的整行。即我的最终输出应如下所示。

[[asthma, susceptible genes, natural language processing analysis, network centrality analysis], [natural language processing, network analysis], [data mining, text mining, learning analytics, deep learning, network centrality analysis]]

我正在使用以下python程序。

with open(input_file, encoding="utf8") as fo:
    for line in fo:

        if line[:2].isupper():

            if line[:2] == 'GJ':
                temp_line = line[2:].strip()

                next_line = next(fo)

                if next_line[:2].isupper():
                    keywords = temp_line.split(';')
                else:
                    mykeywords = temp_keywords + ' ' + next_line.strip()
                    keywords = mykeywords.split(';')
                print(keywords)

但是，我忽略下一行的方式存在问题。因此，根据我的程序，我没有得到GJ的第三行（即[data mining, text mining, learning analytics, deep learning, network centrality analysis]）作为输出列表。

很高兴在需要时提供更多详细信息。

Answer 1

这就是您要尝试执行的操作，可能只需进行一些调试就可以到达那里。

temp_keywords = ''
mykeywords = ''
with open(input_file, encoding="utf8") as fo:    
    for line in fo:
        if line[:2].isupper():    
            if line[:2] == 'GJ':
                temp_line = line[2:].strip()
                next_line = next(fo)
                temp_line += next_line.strip()
                print (temp_line.split(';'))

这里的问题是您自己调用next（fo）而不是让for循环执行其工作，而是意味着您必须处理所有for循环的工作。因此，您读入next_line的任何内容都不会在下一个循环中处理。您将完全错过文件的某些行。

相反，您始终希望让for循环处理其工作。

但是这里您有两种不同的方法来分解文件。编写一个记录解析器来查找完整记录会更容易，并让它根据需要从文件中读取行。这是我在评论中链接的其他答案的改编：

def is_new_record(line):
    return line[:2].isupper()

def helper(text):
    data = []
    for line in text.readlines():
        if is_new_record(line):
            if (data):
                yield ''.join(data)
            data = [line.strip()]
        else:
            data.append(line.strip())
    if (data):
        yield ''.join(data)

# the helper is a generator for multiline records, as one line
input_file = 'data.txt'
with open(input_file) as f:
    for record in helper(f):
        print (record)

洛杉矶英语
  DT文章
  GJ哮喘;易感基因自然语言处理分析；网络中心分析
  基于身份识别的发现共词分析；未发现的公告；信息检索；鱼油;科学合作;胰岛素抵抗;复杂的网络；二甲双胍;肥胖
  GJ自然语言处理；网络分析
  GJ数据挖掘；文本挖掘学习分析；深度学习；网络中心度分析

Answer 2

让我们尝试解决问题。您的代码中有两个主要的逻辑过程：

用以下缩进的行提取每个非缩进的行，并将它们作为单个“行”连接起来。
仅过滤“ GJ”初始行。

代码如下：

def iter_lines(fo):
    cur_line = []
    for row in fo:
        if not row.startswith(' ') and cur_line:
            yield ' '.join(cur_line)
            cur_line = []  # reset the cache
        cur_line.append(row.strip())
    # yield the last line
    if cur_line:
        yield ' '.join(cur_line)


with open(input_file, encoding="utf8") as fo:
    for line in iter_lines(fo):
        if line.startswith('GJ'):
            keywords = [k.strip() for k in line[2:].split(';')]
            print(keywords)

如何有效读取文件中的下一行

2 个答案: