Question

我正在创建一个python脚本，该脚本可以读取扫描的表格和表格.pdfs并提取一些重要数据，并将其插入JSON中，以便稍后实现到SQL数据库中（我还将DB开发为一个学习项目MongoDB）。

基本上，我的问题是我以前从未使用过任何JSON文件，但是建议将其输出为该格式。抓取脚本起作用了，预处理可能要干净得多，但是现在它起作用了。我遇到的问题是键，并且值在同一列表中，并且某些值（因为它们具有小数点）是两个不同的列表项。甚至不确定从哪里开始。

我真的不知道从哪里开始，因为我知道列表的索引是什么，所以我可以轻松地分配键和值，但是它可能不适用于任何.pdf，也就是说脚本不能被明确编码。

import PyPDF2 as pdf2
import textract

with "TestSpec.pdf" as filename:
    pdfFileObj = open(filename, 'rb')
    pdfReader = pdf2.pdfFileReader(pdfFileObj)
    num_pages = pdfReader.numpages
    count = 0
    text = ""

    while count < num_pages:
        pageObj = pdfReader.getPage(0)
        count += 1
        text += pageObj.extractText()

    if text != "":
        text = text
    else:
        text = textract.process(filename, method="tesseract", language="eng")

def cleanText(x):
    '''
    This function takes the byte data extracted from scanned PDFs, and cleans it of all
    unnessary data.
    Requires re
    '''
    stringedText = str(x)
    cleanText = stringedText.replace('\n','')
    splitText = re.split(r'\W+', cleanText)
    caseingText = [word.lower() for word in splitText]
    cleanOne = [word for word in caseingText if word != 'n']
    dexStop = cleanOne.index("od260")
    dexStart = cleanOne.index("sheet")
    clean = cleanOne[dexStart + 1:dexStop]
    return clean

cleanText = cleanText(text)

这是当前输出

['n21', 'feb', '2019', 'nsequence', 'lacz', 'rp', 'n5', 'gat', 'ctc', 'tac', 'cat', 'ggc', 'gca', 'cat', 'ttc', 'ccc', 'gaa', 'aag', 'tgc', '3', 'norder', 'no', '15775199', 'nref', 'no', '207335463', 'n25', 'nmole', 'dna', 'oligo', '36', 'bases', 'nproperties', 'amount', 'of', 'oligo', 'shipped', 'to', 'ntm', '50mm', 'nacl', '66', '8', 'xc2', 'xb0c', '11', '0', '32', '6', 'david', 'cook', 'ngc', 'content', '52', '8', 'd260', 'mmoles', 'kansas', 'state', 'university', 'biotechno', 'nmolecular', 'weight', '10', '965', '1', 'nnmoles']

，我们希望将输出作为JSON设置，例如

{"Date | 21feb2019", "Sequence ID: | lacz-rp", "Sequence 5'-3' | gat..."}

，依此类推。只是不确定如何做到这一点。

这是我的样本pdf

中的数据的屏幕截图

所以，我已经弄清楚了其中的一些。在没有显式编程的情况下，获取我需要的数据的最后3个仍然存在问题。但这是我到目前为止所拥有的。一旦一切正常，我将担心优化和压缩。

# for PDF reading
import PyPDF2 as pdf2
import textract
# for data preprocessing
import re
from dateutil.parser import parse
# For generating the JSON file array
import json
# This finds and opens the pdf file, reads the data, and extracts the data.
filename = "*.pdf"
pdfFileObj = open(filename, 'rb')
pdfReader = pdf2.PdfFileReader(pdfFileObj)
text = ""
pageObj = pdfReader.getPage(0)
text += pageObj.extractText()

# checks if extracted data is in string form or picture, if picture textract reads data.
# it then closes the pdf file
if text != "":
    text = text
else:
    text = textract.process(filename, method="tesseract", language="eng")
pdfFileObj.close()

# Converts text to string from byte data for preprocessing
stringedText = str(text)
# Removed escaped lines and replaced them with actual new lines.
formattedText = stringedText.replace('\\n', '\n').lower()
# Slices the long string into a workable piece (only contains useful data)
slice1 = formattedText[(formattedText.index("sheet") + 10): (formattedText.index("secondary") - 2)]
clean = re.sub('\n', " ", slice1)
clean2 = re.sub(' +', ' ', clean)

# Creating the PrimerData dictionary
with open("PrimerData.json",'w') as file:
    primerDataSlice = clean[clean.index("molecular"): -1]
    primerData = re.split(": |\n", primerDataSlice)
    primerKeys = primerData[0::2]
    primerValues = primerData[1::2]
    primerDict = {"Primer Data": dict(zip(primerKeys,primerValues))}
    # Generatring the JSON array "Primer Data"
    primerJSON = json.dumps(primerDict, ensure_ascii=False)
    file.write(primerJSON)

# Grabbing the date (this has just the date, so json will have to add date.)
date = re.findall('(\d{2}[\/\- ](\d{2}|january|jan|february|feb|march|mar|april|apr|may|may|june|jun|july|jul|august|aug|september|sep|october|oct|november|nov|december|dec)[\/\- ]\d{2,4})', clean2)

Answer 1

没有输入数据，很难给您工作代码。一个带有输入的最小工作示例将有所帮助。至于JSON处理，python字典可以轻松地转储到json。在此处查看示例。 https://docs.python-guide.org/scenarios/json/

从字典中获取json字符串并写入文件。弄清楚如何将文本解析为字典。

import json
d = {"Date" : "21feb2019", "Sequence ID" : "lacz-rp", "Sequence 5'-3'" : "gat"}
json_data = json.dumps(d)
print(json_data)
# Write that data to a file

Answer 2

所以，我确实弄清楚了，问题确实在于，由于我的预处理将所有数据提取到一个列表中的方式，考虑到{{1} }，因为字典从未更改。

这是制作Dictionary和JSON文件的半成品。

keys

有没有办法获取字符串列表并创建JSON文件（其中键和值都是列表项）？

2 个答案: