我是基本级别的python用户,我正在尝试创建一个程序,该程序为之前和之后的文本(例如之前的50个单词和之后的50个单词)提供一个我使用的特定单词。到目前为止,我设法创建了一个程序,该程序给出了提到的PDF页面。我该如何将这100个单词写到CVS中?
import PyPDF2
import re
import os
...
for pdfName in pdffiles:
pdfFull = pdfFolder + pdfName
pdfFileObj = open(pdfFull, mode='rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
number_of_pages = pdfReader.numPages
pages_text = []
words_start_pos = {}
words = {}
csvFolder = newpath
csvName = pdfName.replace('pdf', 'csv')
csvFull = csvFolder + csvName
with open(csvFull, 'w') as f:
f.write('{0},{1},{2}\n'.format("Sheet Number", "Search Word", "File Name"))
for word in searchwords:
for page in range(number_of_pages):
pages_text.append(pdfReader.getPage(page).extractText())
words_start_pos[page] = [dwg.start() for dwg in re.finditer(word, pages_text[page].lower())]
words[page] = [pages_text[page][value:value + len(word)] for value in words_start_pos[page]]
for page in words:
for i in range(0, len(words[page])):
if str(words[page][i]) != 'nan':
f.write('{0},{1},{2}\n'.format(page + 1, words[page][i], pdfFull))
答案 0 :(得分:0)
我认为没有必要抓住页面的每个字母并找到第一个字母的索引,相反,您仍然可以执行以下操作:
pages_text.append(pdfReader.getPage(page).extractText())
然后执行以下操作:
pages_text[0].split()
这将使您从提取的文本中获取每个单词,因此您已经有了这些单词,而不必为字母编制索引并且不必弄清楚单词的开始和结束位置。此时,我将遍历单词并找到单词的索引,然后从该单词的位置加减50并打印出来。我在pdf的第一页上使用了它,如下所示:
import PyPDF2
import re
import os
pdfFileObj = open(r'C:\path','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
number_of_pages = pdfReader.numPages
pages_text = []
words_start_pos = {}
words = {}
searchwords = ["pdf"]
word_pos = 0
print_words = []
word_pos = []
print_text = ''
line = []
for word in searchwords:
for page in range(number_of_pages):
pages_text.append(pdfReader.getPage(page).extractText())
text = pages_text[0].split()
for each_word in range(0, len(text)):
if(text[each_word] == "PDF"):
word_pos.append(each_word)
print(word_pos)
for each_pos in word_pos:
for each_word in range(each_pos-50, each_pos+50):
print_text = print_text +' ' + text[each_word]
line.append(print_text)
print_text = ''
print(line)
with open(r'C:\path', 'w') as f:
f.write('{0},{1},{2}\n'.format("Sheet Number", word, "File Name"))
for each_line in line:
f.write('{0},{1},{2}\n'.format(page + 1, each_line, r'C:\path'))
注意:我会非常谨慎地将从pdf抓取的文本保存在csv文件中,因为文本中很可能会有逗号,这会弄乱您的csv文件。我希望这会有所帮助!