从txt文件中提取连续的文本部分

时间:2019-03-07 21:44:26

标签: python

我有一个文本文件,其中包含连续的问题和答案部分。例如:

Q1: some lines of text.
Answer: some lines of text.
Q2: some lines of text.
Answer: some lines of text.

我想从文本文件中提取问题和答案,然后将它们放入具有两列(问题和答案)的csv文件中,问题和答案将进入相应的列中。

这是我现在拥有的代码(仍然很初级):

for line in file:
    if line.strip() == 'Answer :':
       print(line)
       break
for line in file:
    if line.startswith('Q'):
       break
        print(line)

但这仅打印出答案的第一个实例。我该怎么办?

这是文件的示例:

Q1: What is the piston rod and the connecting rod?
Answer:
Piston Rod
A rod which is connected to the piston and the connecting rod is called piston rod. 
Connecting Rod
The intermediate element between piston rod and crankshaft is the connecting rod. 

Q2: State the constructional features and material requirements of connecting rod.
Answer: 
1. The cross-section of connecting rod is I-section and the rods should be designed long, inorder to satisfy our need and
requirement.
2. The rods should have high strength and should not fail, when axial loads are applied on them.

这是文件一部分的屏幕截图:

screenshot

这是文本文件中问答格式的示例:

Q1. 
What is the piston rod and the connecting rod? 
Answer :  
Piston Rod

A rod which is connected to the piston and the connecting rod is called piston rod. It transmits gas pressure developed by 
the fuel or steam to the crankshaft through connecting rod. One end of piston rod is attached to the piston by a tapered rod with a 
nut and the other end is joined with the connecting rod, through a crosshead by a cotter-pin. These ends are having no revolving 
movement and hence, they are considered as fixed ends.
Connecting Rod

The intermediate element between piston rod and  crankshaft is the connecting rod. It consists of a small end which acts as 
a connection for piston rod and a big end, that is usually split to accommodate the crank pin bearing shells. When the fuel force 
is transmitted from piston rod to crankshaft, the connecting rod is also subjected to alternate tensile and compressive forces. The 
compressive load is taken as the design load for the connecting rod, similar to the design of piston rod.
Q2. 
State the constructional features and material requirements of connecting rod.
Answer : 
1. 
The cross-section of connecting rod is I-section and the rods should be designed long, inorder to satisfy our need and 
requirement.
2. 
The rods should have high strength and should not fail, when axial loads are applied on them.
3. 
Connecting rods are made up of carbon steels or alloy steels of molybdenum and chromium, as these materials have high 
tensile and compressive strengths.
Q3. 
Write about the forces acting on the connecting rod.
OR

Explain the various types of stresses induced in the connecting rod.

2 个答案:

答案 0 :(得分:0)

我们假设您的文本文件为:

Q1 : What is your name?
Answer: Joe

Q2: What is your last name?
Answer: Joe Joe

现在我们可以创建字典:

df = open('myfile.txt', 'r')
df = df.readlines()
ques = []
ans = []
for items in df:
    if "Q" in items:
        ques.append(items)
    elif "Answer" in items:
        ans.append(items)
dictionary = {q:v for q, v in zip(ques, ans)}

print(dictionary)
> {'Q1 : What is your name?\n': 'Answer: Joe\n',
 'Q2: What is your last name?\n': 'Answer: Joe Joe'}

我还假设每个问题后面都有答案。如果不是这样,可能需要更新

答案 1 :(得分:0)

我认为不需要使用完整的问题文本作为关键词典。然后,您需要事先知道它才能恢复答案。
您可以使用两个单独的列表或词典,一个用于问题,一个用于答案。如果使用列表,只需确保问题和相应的答案在同一索引中即可。如果使用词典,请在每个词典中为问题和相应的答案使用相同的键(可以是一个渐进数字)。

这里有两个字典的例子:

import re

questions = {}
answers = {}

c = 1
scanquestion = True
with open("myfile.txt", "r") as ff:
    for line in ff:
        if re.search(r"^Q\d+", line) is not None:
            scanquestion = True
            questions[c] = line
        elif 'Answer' in line:
            scanquestion = False
            answers[c] = ""
            c += 1
        elif line == '\n':
            pass
        else:
            if scanquestion:
                questions[c] += line
            else:
                answers[c-1] += line

print(questions)
print(answers)

questions[1]是第一个问题,answers[1]是相应的答案。

在评论和问题编辑后进行

编辑

看完屏幕截图并阅读评论后,我认为您在答案和问题之间没有任何新的界限。
我已经编辑了答案。我使用regex在一行的开头搜索“ Q1”,“ Q2”以识别新问题,并且不对是否存在空行进行任何假设(如果存在,则将其跳过)