我正在计算某些总统演讲中的收缩次数,并希望将这些收缩输出到CSV或文本文件中。这是我的代码:
import urllib2,sys,os,csv
from bs4 import BeautifulSoup,NavigableString
from string import punctuation as p
from multiprocessing import Pool
import re, nltk
import requests
import math, functools
import summarize
reload(sys)
def processURL_short(l):
open_url = urllib2.urlopen(l).read()
item_soup = BeautifulSoup(open_url)
item_div = item_soup.find('div',{'id':'transcript'},{'class':'displaytext'})
item_str = item_div.text.lower()
return item_str
every_link_test = ['http://www.millercenter.org/president/obama/speeches/speech-4427',
'http://www.millercenter.org/president/obama/speeches/speech-4424',
'http://www.millercenter.org/president/obama/speeches/speech-4453',
'http://www.millercenter.org/president/obama/speeches/speech-4612',
'http://www.millercenter.org/president/obama/speeches/speech-5502']
data = {}
count = 0
for l in every_link_test:
content_1 = processURL_short(l)
for word in content_1.split():
word = word.strip(p)
if word in contractions:
count = count + 1
splitlink = l.split("/")
president = splitlink[4]
speech_num = splitlink[-1]
filename = "{0}_{1}".format(president,speech_num)
data[filename] = count
print count, filename
with open('contraction_counts.csv','w',newline='') as fp:
a = csv.writer(fp,delimiter = ',')
a.writerows(data)
运行for
循环打印
79 obama_speech-4427
101 obama_speech-4424
101 obama_speech-4453
182 obama_speech-4612
224 obama_speech-5502
我想将其导出到文本文件,其中左侧的数字是一列,总统/语音编号位于第二列。我的with
语句只是将每一行写入一个单独的文件,这绝对不是最理想的。
答案 0 :(得分:1)
您可以尝试这样的事情,这是一种通用方法,可以根据需要进行修改
import csv
with open('somepath/file.txt', 'wb+') as outfile:
w = csv.writer(outfile)
w.writerow(['header1', 'header2'])
for i in you_data_structure: # eg list or dictionary i'm assuming a list structure
w.writerow([
i[0],
i[1],
])
或者如果是字典
import csv
with open('somepath/file.txt', 'wb+') as outfile:
w = csv.writer(outfile)
w.writerow(['header1', 'header2'])
for k, v in your_dictionary.items(): # eg list or dictionary i'm assuming a list structure
w.writerow([
k,
v,
])
答案 1 :(得分:1)
您的问题是您在w
模式下打开循环内的输出文件,这意味着它会在每次迭代时被删除。您可以通过两种方式轻松解决问题:
模式循环外的open
(正常方式)。您将只打开一次文件,在每次迭代时添加一行并在退出with
块时关闭它:
with open('contraction_counts.csv','w',newline='') as fp:
a = csv.writer(fp,delimiter = ',')
for l in every_link_test:
content_1 = processURL_short(l)
for word in content_1.split():
word = word.strip(p)
if word in contractions:
count = count + 1
splitlink = l.split("/")
president = splitlink[4]
speech_num = splitlink[-1]
filename = "{0}_{1}".format(president,speech_num)
data[filename] = count
print count, filename
a.writerows(data)
以a
(追加)模式打开文件。在每次迭代中,您重新打开文件并在最后写入而不是擦除它 - 这种方式因为打开/关闭而使用更多的IO资源,并且只有在程序可以中断并且您希望确保所有这些都是在崩溃之前写的实际上已经保存到磁盘
for l in every_link_test:
content_1 = processURL_short(l)
for word in content_1.split():
word = word.strip(p)
if word in contractions:
count = count + 1
splitlink = l.split("/")
president = splitlink[4]
speech_num = splitlink[-1]
filename = "{0}_{1}".format(president,speech_num)
data[filename] = count
print count, filename
with open('contraction_counts.csv','a',newline='') as fp:
a = csv.writer(fp,delimiter = ',')
a.writerows(data)