我使用python时遇到问题。 我有一个txt文件,它包含500篇论文的500篇摘要, 我想要做的是将此txt文件拆分为500个文件,每个txt文件只包含1个摘要。 现在,我发现,对于每个摘要,最后都有一行,以" PMID"开头,所以我想通过这一行分割文件。 但我对python真的很陌生。 任何的想法? 提前谢谢。
txt文件如下所示:
1. Ann Intern Med. 2013 Dec 3;159(11):721-8. doi:10.7326/0003-4819-159-11-201312030-00004.
text text text texttext texttext texttext texttext texttext texttext texttext text
text texttext texttext texttext texttext texttext text
text texttext texttext texttext texttext text
PMID: 24297188 [PubMed - indexed for MEDLINE]
2. Am J Cardiol. 2013 Sep 1;112(5):688-93. doi: 10.1016/j.amjcard.2013.04.048. Epub
2013 May 24.
text texttext texttext texttext texttext texttext texttext texttext texttext text
text texttext texttext texttext texttext texttext texttext texttext texttext text
PMID: 23711805 [PubMed - indexed for MEDLINE]
3. Am J Cardiol. 2013 Aug 15;112(4):513-9. doi: 10.1016/j.amjcard.2013.04.015. Epub
2013 May 11.
text texttext texttext texttext texttext texttext texttext texttext texttext text
text texttext texttext texttext texttext texttext texttext texttext texttext text
PMID: 23672989 [PubMed - indexed for MEDLINE]
等等。
答案 0 :(得分:1)
有很多方法可以做到这一点。这是一种方式。
如果数据位于名为data
的文件中:
import re
def open_chunk(readfunc, delimiter, chunksize=1024):
"""
http://stackoverflow.com/a/17508761/190597
readfunc(chunksize) should return a string.
"""
remainder = ''
for chunk in iter(lambda: readfunc(chunksize), ''):
pieces = re.split(delimiter, remainder + chunk)
for piece in pieces[:-1]:
yield piece
remainder = pieces[-1]
if remainder:
yield remainder
with open('data', 'r') as infile:
chunks = open_chunk(infile.read, delimiter=r'(PMID.*)')
for i, (chunk, delim) in enumerate(zip(*[chunks]*2)):
chunk = chunk+delim
chunk = chunk.strip()
if chunk:
print(chunk)
print('-'*80)
# uncomment this if you want to save the chunk to a file named dataXXX
# with open('data{:03d}'.format(i), 'w') as outfile:
# outfile.write(chunk)
打印
1. Ann Intern Med. 2013 Dec 3;159(11):721-8. doi:10.7326/0003-4819-159-11-201312030-00004.
text text text texttext texttext texttext texttext texttext texttext texttext text
text texttext texttext texttext texttext texttext text
text texttext texttext texttext texttext text
PMID: 24297188 [PubMed - indexed for MEDLINE]
--------------------------------------------------------------------------------
2. Am J Cardiol. 2013 Sep 1;112(5):688-93. doi: 10.1016/j.amjcard.2013.04.048. Epub
2013 May 24.
text texttext texttext texttext texttext texttext texttext texttext texttext text
text texttext texttext texttext texttext texttext texttext texttext texttext text
PMID: 23711805 [PubMed - indexed for MEDLINE]
--------------------------------------------------------------------------------
3. Am J Cardiol. 2013 Aug 15;112(4):513-9. doi: 10.1016/j.amjcard.2013.04.015. Epub
2013 May 11.
text texttext texttext texttext texttext texttext texttext texttext texttext text
text texttext texttext texttext texttext texttext texttext texttext texttext text
PMID: 23672989 [PubMed - indexed for MEDLINE]
--------------------------------------------------------------------------------
取消注释最后两行以将块保存为单独的文件。
为何如此复杂?
对于短文件,您只需将整个文件读入字符串并使用正则表达式拆分字符串即可。上面的解决方案是对可以处理大文件的想法的改编。它以块的形式读取文件,找到拆分块的位置,并在找到它们时返回碎片。
以分隔符正则表达式模式分隔的块处理文件的问题经常出现。因此,不是为每个人编写一个定制的解决方案,而是更容易使用像open_chunk
这样的实用程序函数,它可以处理所有这些问题,无论分隔符是什么,并且以一种方式处理大文件以及小文件
答案 1 :(得分:1)
你可以尝试:
with open("txtfile.txt", "r") as f: # read file
ss = f.read(-1)
bb = ss.split("\nPMID:") # split in blocks
# Reinsert the `PMID;`, if nedded:
bb1 = bb[:1] + [ "PMID:" + b for b in bb]
请注意,每个块中的最终换行符都将被删除。这些块可以写入单独的文件中。