我有一个关于如何正确读取CSV文件的问题,以便您可以应用NLTK中存在的技术。我的目标是逐行生成CSV文件而不是一行。
我的第一次尝试是:file= open("data/MyFile.csv")
。我的.csv文件有40k +行。这样我就意识到我的目的并不合适,我将其改为:
import csv
import preprocessing
from preprocessing import PreProcessing
def utf_8_encoder(unicode_csv_data):
for line in unicode_csv_data:
yield line.encode('utf-8')
def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
# csv.py doesn't do Unicode; encode temporarily as UTF-8:
csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
dialect=dialect, **kwargs)
for row in csv_reader:
# decode UTF-8 back to Unicode, cell by cell:
yield [unicode(cell, 'utf-8') for cell in row]
with open("data/MyFile.csv", 'rb') as csvfile:
#I had to remove the Sniffer, because even without indicating a delimiters it was giving error that it did not find a delimiters.
#dialect = csv.Sniffer().sniff(csvfile.read(1024))
#csvfile.seek(0)
lower_stream = (line.lower() for line in csvfile) #Normalizing. Putting all text in tiny
#Reading the file
corpus = csv.DictReader(unicode_csv_reader(lower_stream), fieldnames='status_message',dialect ='excel')
def status_processing(corpus):
myCorpus = preprocessing.PreProcessing()
myCorpus.text = corpus
myCorpus.initial_processing()
fieldnames='status_message'
这是我想要阅读的字段。状态消息是标题,用于标识csv中包含的文本
之后,我开始应用技巧使NLTK更容易在我的文本中使用,其中一个是BeautifulSoup。
我这样做的方式显示在def status_processing(corpus)
中。
另一个脚本的调用方法构造如下:
tokens = None
def initial_processing(self):
soup = BeautifulSoup(self.text,"html.parser")
self.text = soup.get_text()
#Todo Se quiser salvar os links mudar aqui
self.text = re.sub(r'(http://|https://|www.)[^"\' ]+', " ", self.text)
self.tokens = self.tokenizing(1, self.text)
pass
这样,当我运行脚本时,会显示错误消息:
line 39, in initial_processing
soup = BeautifulSoup(self.text,"html.parser")
File "/usr/lib/python2.7/dist-packages/bs4/__init__.py", line 176, in __init__
elif len(markup) <= 256:
AttributeError: DictReader instance has no attribute '__len__'
我该怎么做才能逐行读取CSV文件,而不会出现此错误?