我正在尝试使用python将多行写入CSV文件,而且我已经在使用此代码一段时间来拼凑如何执行此操作。我的目标只是使用牛津字典网站,并将每年创建的年份和文字网页划分为csv文件。我希望每一行都以我搜索的年份开始,然后横向列出所有单词。然后,我希望能够重复多年。
到目前为止,这是我的代码:
import requests
import re
import urllib2
import os
import csv
year_search = 1550
subject_search = ['Law']
path = '/Applications/Python 3.5/Economic'
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
urllib2.install_opener(opener)
user_agent = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
header = {'User-Agent':user_agent}
request = urllib2.Request('http://www.oed.com/', None, header)
f = opener.open(request)
data = f.read()
f.close()
print 'database first access was successful'
resultPath = os.path.join(path, 'OED_table.csv')
htmlPath = os.path.join(path, 'OED.html')
outputw = open(resultPath, 'w')
outputh = open(htmlPath, 'w')
request = urllib2.Request('http://www.oed.com/search?browseType=sortAlpha&case-insensitive=true&dateFilter='+str(year_search)+'&nearDistance=1&ordered=false&page=1&pageSize=100&scope=ENTRY&sort=entry&subjectClass='+str(subject_search)+'&type=dictionarysearch', None, header)
page = opener.open(request)
urlpage = page.read()
outputh.write(urlpage)
new_word = re.findall(r'<span class=\"hwSect\"><span class=\"hw\">(.*?)</span>', urlpage)
print str(new_word)
outputw.write(str(new_word))
page.close()
outputw.close()
这将输出我在1550年确定的单词字符串。然后我尝试将代码写入我的计算机上的csv文件,它确实如此,但我想做两件事,我搞乱了在这里:
我的代码的下一部分:
with open('OED_table.csv', 'w') as csvfile:
fieldnames = ['year_search']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerow({'year_search': new_word})
我使用csv
模块online documentation作为代码第二部分的参考。
为了澄清,我提供了代码的第一部分以提供透视。
答案 0 :(得分:3)
你真的不应该用正则表达式来解析html。也就是说,这里是如何修改代码以生成所有找到的单词的csv文件。
注意:由于未知原因,结果词列表的长度从一次执行到下一次执行会有所不同。
import csv
import os
import re
import requests
import urllib2
year_search = 1550
subject_search = ['Law']
path = '/Applications/Python 3.5/Economic'
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
urllib2.install_opener(opener)
user_agent = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
header = {'User-Agent':user_agent}
# commented out because not used
#request = urllib2.Request('http://www.oed.com/', None, header)
#f = opener.open(request)
#data = f.read()
#f.close()
#print 'database first access was successful'
resultPath = os.path.join(path, 'OED_table.csv')
htmlPath = os.path.join(path, 'OED.html')
request = urllib2.Request(
'http://www.oed.com/search?browseType=sortAlpha&case-insensitive=true&dateFilter='
+ str(year_search)
+ '&nearDistance=1&ordered=false&page=1&pageSize=100&scope=ENTRY&sort=entry&subjectClass='
+ str(subject_search)
+ '&type=dictionarysearch', None, header)
page = opener.open(request)
with open(resultPath, 'wb') as outputw, open(htmlPath, 'w') as outputh:
urlpage = page.read()
outputh.write(urlpage)
new_words = re.findall(
r'<span class=\"hwSect\"><span class=\"hw\">(.*?)</span>', urlpage)
print new_words
csv_writer = csv.writer(outputw)
for word in new_words:
csv_writer.writerow([year_search, word])
以下是OED_table.csv
文件工作时的内容:
1550,above bounden
1550,accomplice
1550,baton
1550,civilist
1550,garnishment
1550,heredity
1550,maritime
1550,municipal
1550,nil
1550,nuncupate
1550,perjuriously
1550,rank
1550,semi-
1550,torture
1550,unplace