My goal is to convert a 6000-record CSV file into an array, clean and normalize it, so that I can read it into a corpus.Dictionary() to use in doc2bow in Gensim to perform a SparseMatrixSimiliarity query. I was successful in reading in the CSV file at first, and it printed out an array I call "definitions" for each one of the 6000 sub-category record numbers.
f = open('test.csv')
csv_f = csv.reader(f)
definitions = []
for row in csv_f:
definitions.append(row[2])
print(definitions)
But then hit a wall with UTF-8 and ASCII errors. Gensim has "strict" UTF-8 settings.
After several hours spent on Stack Overflow, researching, and trying to apply a few "UTF-8" encoders per the Python CSV documentation, I read that since Python 2.7 doesn't have "out of the box" unicode-encoding using the import csv package, that I could use the codecs package.
I figured that instead of finding every line in my original "definitions" 6000-line array and decoding, that I could take an initial stab at decoding it right off the bat using codecs. However, the below code fails to write anything to my definitions array. Being a newbie, I imagine that I may be using codecs the wrong way, and/or closing the wrong way.
with codecs.open('test.csv', 'rb', encoding='utf-8') as f:
csv_f = csv.reader(f)
definitions= []
for row in csv_f:
definitions.append(np.array((array.float(i) for i in l)))
f.close()
print(definitions)
I am a total newbie, apologies for any errors in my description. Learning as I go, really appreciate any feedback and help. Perhaps I'm going about this the wrong way, and welcome any education. Thank you again.