我很抱歉,如果这太基础了......基本上,我使用pandas
加载一个巨大的CSV
文件,然后将其转换为numpy
数组进行后期处理。我感谢任何帮助!
问题是转换过程中缺少某些字符串(从pandas dataframe
到numpy array
)。例如,列中的字符串" abstract"完整见下文print datafile["abstract"][0]
。但是,一旦我将它们转换为numpy array
,只剩下几个字符串。见下文print df_all[0,3]
import pandas as pd
import csv
import numpy as np
datafile = pd.read_csv(path, header=0)
df_all = pd.np.array(datafile, dtype='string')
header_t = list(datafile.columns.values)
print datafile["abstract"][0]
In order to test the widely held assumption that homeopathic medicines contain negligible quantities of their major ingredients, six such medicines labeled in Latin as containing arsenic were purchased over the counter and by mail order and their arsenic contents measured. Values determined were similar to those expected from label information in only two of six and were markedly at variance in the remaining four. Arsenic was present in notable quantities in two preparations. Most sales personnel interviewed could not identify arsenic as being an ingredient in these preparations and were therefore incapable of warning the general public of possible dangers from ingestion. No such warnings appeared on the labels.
print df_all[0,3]
In order to test the widely held assumption that homeopathic me
答案 0 :(得分:3)
我认为,当您指定dtype='string'
时,实质上是指定默认的S64
类型,它会将字符串截断为64个字符。只需跳过dtype='string'
部分你就应该好了(dtype
将成为object
)。
更好的是,不要将DataFrame
转换为array
,请使用内置版df.values
。