Question

我有一个关于如何正确读取CSV文件的问题，以便您可以应用NLTK中存在的技术。我的目标是逐行生成CSV文件而不是一行。

我的第一次尝试是：file= open("data/MyFile.csv")。我的.csv文件有40k +行。这样我就意识到我的目的并不合适，我将其改为：

 import csv
import preprocessing
from preprocessing import PreProcessing
def utf_8_encoder(unicode_csv_data):
    for line in unicode_csv_data:
        yield line.encode('utf-8')
def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
    # csv.py doesn't do Unicode; encode temporarily as UTF-8:
    csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
                            dialect=dialect, **kwargs)
    for row in csv_reader:
        # decode UTF-8 back to Unicode, cell by cell:
        yield [unicode(cell, 'utf-8') for cell in row]

with open("data/MyFile.csv", 'rb') as csvfile:
    #I had to remove the Sniffer, because even without indicating a delimiters it was giving error that it did not find a delimiters.
    #dialect = csv.Sniffer().sniff(csvfile.read(1024))
    #csvfile.seek(0)
    lower_stream = (line.lower() for line in csvfile) #Normalizing. Putting all text in tiny
    #Reading the file
    corpus = csv.DictReader(unicode_csv_reader(lower_stream), fieldnames='status_message',dialect ='excel')

def status_processing(corpus):

    myCorpus = preprocessing.PreProcessing()
    myCorpus.text = corpus
    myCorpus.initial_processing()

fieldnames='status_message'这是我想要阅读的字段。状态消息是标题，用于标识csv中包含的文本

之后，我开始应用技巧使NLTK更容易在我的文本中使用，其中一个是BeautifulSoup。

我这样做的方式显示在def status_processing(corpus)中。

另一个脚本的调用方法构造如下：

tokens = None
    def initial_processing(self):
        soup = BeautifulSoup(self.text,"html.parser")
        self.text = soup.get_text()
        #Todo Se quiser salvar os links mudar aqui
        self.text = re.sub(r'(http://|https://|www.)[^"\' ]+', " ", self.text)
        self.tokens = self.tokenizing(1, self.text)
        pass

这样，当我运行脚本时，会显示错误消息：

line 39, in initial_processing
    soup = BeautifulSoup(self.text,"html.parser")
  File "/usr/lib/python2.7/dist-packages/bs4/__init__.py", line 176, in __init__
    elif len(markup) <= 256:
AttributeError: DictReader instance has no attribute '__len__'

我该怎么做才能逐行读取CSV文件，而不会出现此错误？

Python2.7 - CSV DictReader

0 个答案: