我想从文本文件导入数据,并将其作为连续的内存数组读取。这是数据,每个受访者都以一个回报分隔:
['vrouw', 43, '2', 'onbeantwoord', '2', '2', 'onbeantwoord', '']
['vrouw', 34, '2', 'onbeantwoord', '2', '2', 'onbeantwoord', '']
['vrouw', 32, '2', 'onbeantwoord', '2', '2', 'onbeantwoord', '']
['vrouw', 32, '2', 'onbeantwoord', '2', '2', 'onbeantwoord', '']
['vrouw', 43, '3', 'sport', '2', '2', 'onbeantwoord', '']
['vrouw', 32, '2', 'onbeantwoord', '2', '2', 'onbeantwoord', '']
['vrouw', 43, '2', 'onbeantwoord', '3', '3', 'collega', 'nee']
我尝试使用以下代码从文本文件中导入数据:
vragenlijst_data= np.genfromtxt('antwoorden.txt', delimiter=',', dtype=None, names=('geslacht', 'leeftijd', 'stelling1', 'doorvraag1', 'stelling2', 'stelling3', 'doorvraag3', 'opmerking'))
然而,这样我就不能以矢量化的方式使用np.mean(来自numpy库),因为我没有连续的内存数组。有没有人知道一种方法来读取数据,以便我有一个连续的内存数组(最好是numpy)?
答案 0 :(得分:1)
使用行的复制粘贴:
In [362]: txt
Out[362]: "['vrouw', 43, '2', 'onbeantwoord', '2', '2', 'onbeantwoord', '']\n\n['vrouw', 34, '2', 'onbeantwoord', '2', '2', 'onbeantwoord', '']\n\n['vrouw', 32, '2', 'onbeantwoord', '2', '2', 'onbeantwoord', '']\n\n['vrouw', 32, '2', 'onbeantwoord', '2', '2', 'onbeantwoord', '']\n\n['vrouw', 43, '3', 'sport', '2', '2', 'onbeantwoord', '']\n\n['vrouw', 32, '2', 'onbeantwoord', '2', '2', 'onbeantwoord', '']\n\n['vrouw', 43, '2', 'onbeantwoord', '3', '3', 'collega', 'nee']"
In [364]: data = np.genfromtxt(txt.splitlines(), delimiter=',',dtype=None, encoding=None)
In [365]: data
Out[365]:
array([("['vrouw'", 43, " '2'", " 'onbeantwoord'", " '2'", " '2'", " 'onbeantwoord'", " '']"),
("['vrouw'", 34, " '2'", " 'onbeantwoord'", " '2'", " '2'", " 'onbeantwoord'", " '']"),
("['vrouw'", 32, " '2'", " 'onbeantwoord'", " '2'", " '2'", " 'onbeantwoord'", " '']"),
("['vrouw'", 32, " '2'", " 'onbeantwoord'", " '2'", " '2'", " 'onbeantwoord'", " '']"),
("['vrouw'", 43, " '3'", " 'sport'", " '2'", " '2'", " 'onbeantwoord'", " '']"),
("['vrouw'", 32, " '2'", " 'onbeantwoord'", " '2'", " '2'", " 'onbeantwoord'", " '']"),
("['vrouw'", 43, " '2'", " 'onbeantwoord'", " '3'", " '3'", " 'collega'", " 'nee']")],
dtype=[('f0', '<U8'), ('f1', '<i8'), ('f2', '<U4'), ('f3', '<U15'), ('f4', '<U4'), ('f5', '<U4'), ('f6', '<U15'), ('f7', '<U7')])
结果是一个带有字符串和数字字段混合的1d结构化数组,必须按名称引用,而不是列号。
'f1'是数字,因为原始版本中没有引号。所以可以查看该字段,并轻松采取其意思:
In [367]: data['f1']
Out[367]: array([43, 34, 32, 32, 43, 32, 43])
In [368]: np.mean(data['f1'])
Out[368]: 37.0
genfromtxt
不删除括号,因此“f0”字符串仍然包含它们。
额外的引号层也使得将其他字段转换为整数更加困难。
如果文件具有更清晰的csv值,则更容易阅读和使用:
In [372]: txt1 = """vrouw, 43, 2, onbeantwoord, 2, 2, onbeantwoord, ''
...: vrouw, 34, 2, onbeantwoord, 2, 2, onbeantwoord, '' """
...:
In [373]:
In [373]: data1 = np.genfromtxt(txt1.splitlines(), delimiter=',',dtype=None, enc
...: oding=None)
In [374]: data1
Out[374]:
array([('vrouw', 43, 2, ' onbeantwoord', 2, 2, ' onbeantwoord', " ''"),
('vrouw', 34, 2, ' onbeantwoord', 2, 2, ' onbeantwoord', " ''")],
dtype=[('f0', '<U5'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<U13'), ('f4', '<i8'), ('f5', '<i8'), ('f6', '<U13'), ('f7', '<U3')])
In [375]: data1['f0']
Out[375]: array(['vrouw', 'vrouw'], dtype='<U5')
In [376]: data1['f1']
Out[376]: array([43, 34])
In [377]: data1['f5']
Out[377]: array([2, 2])
答案 1 :(得分:0)
您的数据格式错误,看起来只是print
的输出。我不认为你会找到任何库函数来使你的数据可用(gentext构建一个数据格式错误的数组)。所以这里:
import re
with open('antwoorden.txt') as f:
lines = f.readlines()
vragenlijst = []
for line in lines:
line = re.sub("[',\[\]]", '', line.strip())
line = [x for x in line.split()]
if len(line)==7:
line += ['']
vragenlijst.append(tuple(line))
vragenlijst现在是一个包含8元组的python列表,其中每个成员都是一个字符串。 numpy的结构化数组需要元组。所以现在你建立你的dtype:
vragenlijst_dtype = np.dtype([('geslacht', 'U10'), ('leeftijd', 'i4'),
('stelling1', 'U10'), ('doorvraag1', 'U10'), ('stelling2', 'U10'),
('stelling3', 'U10'), ('doorvraag3', 'U10'), ('opmerking', 'U10')])
其中“U10”表示长度为10个字符的unicode,而i4表示长度为4个字节的整数。如果类型不符合您的实际数据,您可以更改类型。
然后:
vragenlijst = np.array(vragenlijst, dtype=vragenlijst_dtype)
list_mean = np.mean(vragenlijst['leeftijd'])
输出“37.0”