从Pandas Dataframe创建Numpy数组时丢失字符串

时间:2015-04-01 17:20:02

标签: python arrays csv numpy pandas

我很抱歉,如果这太基础了......基本上,我使用pandas加载一个巨大的CSV文件,然后将其转换为numpy数组进行后期处理。我感谢任何帮助!

问题是转换过程中缺少某些字符串(从pandas dataframenumpy array)。例如,列中的字符串" abstract"完整见下文print datafile["abstract"][0]。但是,一旦我将它们转换为numpy array,只剩下几个字符串。见下文print df_all[0,3]

import pandas as pd
import csv
import numpy as np

datafile = pd.read_csv(path, header=0)
df_all = pd.np.array(datafile, dtype='string')
header_t = list(datafile.columns.values)

字符串在pandas dataframe`

中完成
print datafile["abstract"][0]
 In order to test the widely held assumption that homeopathic medicines contain negligible quantities of their major ingredients, six such medicines labeled in Latin as containing arsenic were purchased over the counter and by mail order and their arsenic contents measured. Values determined were similar to those expected from label information in only two of six and were markedly at variance in the remaining four. Arsenic was present in notable quantities in two preparations. Most sales personnel interviewed could not identify arsenic as being an ingredient in these preparations and were therefore incapable of warning the general public of possible dangers from ingestion. No such warnings appeared on the labels.

字符串在numpy`

中不完整
print df_all[0,3]
In order to test the widely held assumption that homeopathic me

1 个答案:

答案 0 :(得分:3)

我认为,当您指定dtype='string'时,实质上是指定默认的S64类型,它会将字符串截断为64个字符。只需跳过dtype='string'部分你就应该好了(dtype将成为object)。

更好的是,不要将DataFrame转换为array,请使用内置版df.values