早安,
我试图从这个site中提取数据。我试图获取每个搜索结果的日期,创建者,相关性,描述,主题,受众和访问,并将其放入我的postgres数据库。问题是有时会遗漏Discription。因此,有时结果有6条记录,有时结果有7条记录。
所以我的问题是:如果不存在,我如何为Discription制作一个空的结果。欢迎任何提示如何操作!
到目前为止我的脚本是这样的。如果结果总是有7条记录,那么它会填满数据库(我测试了三条,记住这一点)
import urllib.parse
import urllib.request
import re
import sys
import psycopg2 as dbapi
url = 'https://easy.dans.knaw.nl/ui/'
values = {'wicket:bookmarkablePage':':nl.knaw.dans.easy.web.search.pages.PublicSearchResultPage',
'q' : 'opgraving'}
data = urllib.parse.urlencode(values)
data = data.encode('utf-8')
headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17'
req = urllib.request.Request(url,data, headers =headers)
resp = urllib.request.urlopen(req)
respData = resp.read()
saveRecord= open('C:/Users/berend/Desktop/record.txt','w')
record = re.findall(r'<dd>(.*?)</dd>',str(respData))
for item in record:
saveRecord.write("%s\n" % item)
saveRecord.close()
fin = open("C:/Users/berend/Desktop/record.txt",'r')
fit = open("C:/Users/berend/Desktop/record_schoon.txt",'w')
delete_list = ['</em>', '[',']','<em>','</span>', '<span>', '\\n']
for line in fin:
for word in delete_list:
line = line.replace(word, "")
fit.write(line)
fin.close()
fit.close()
open_record= open('C:/Users/berend/Desktop/record_schoon.txt','r')
content = list(open_record)
print(len(content))
open_record.close()
n = 3
for i in range(0, len(content), 3):
q= content[i:i+n]
con = dbapi.connect(database='import', user='postgres', password='xxx')
cur = con.cursor()
cur.execute("INSERT into import VALUES (%s,%s,%s)",q)
con.commit()
前3个结果:
2000
Groenewoudt, B.J.; Deeben, J.H.C.; Velde, H.M. van der
100% relevant
Na verkennend onderzoek in 1996 en een grootschalige opgraving met uitgebreid bodemkundig
opgraving
Archaeology
Open (registered users)
2001-09
Peters, F.J.C.; Peeters, J.H.M.
100% relevant
opgraving
Archaeology
Open (registered users)
2008
Jacobs, E.; Burnier, C.Y.
100% relevant
OPGRAVING
Archaeology
Open (registered users)
答案 0 :(得分:0)
在这种情况下,我会使用pandas和sqlalchemy库来提高效率。我建议使用其他套餐的解决方案,因为您没有指定&#34;不是&#34;使用它们。
而不是:
n = 3
for i in range(0, len(content), 3):
q= content[i:i+n]
con = dbapi.connect(database='import', user='postgres', password='xxx')
cur = con.cursor()
cur.execute("INSERT into import VALUES (%s,%s,%s)",q)
con.commit()
使用类似的东西:
import pandas as pd
from sqlalchemy import create_engine
# create a connection engine using sqlalchemy
engine = sqla.create_engine('postgresql+psycopg2://postgres:xxx@localhost/import', echo=False)
# read the results file into a pandas DataFrame
df = pd.read_csv('C:/Users/berend/Desktop/record_schoon.txt', delimiter='\t') # or whatever your delimiter is
dfFill = df.fillna("") # "" will be blank space when any record is missing data or 'nan'
dfFill.to_sql("tablename", engine, if_exists="append") #change tablename to the name of your table in import
HTH