我有一个数据集,我在Cython中逐行阅读。每行都以字符串形式返回。我想要做的是将字符串转换为数字数组(整数和浮点数),其长度等于每行中的列数(由分隔符';'给出)。
例如
import pandas as pd
import numpy as np
df = pd.DataFrame(np.c_[np.random.rand(3,2),np.random.randint(0,10,(3,2))], columns = ['a','b','c','d'])
filename = r'H:\mydata.csv'
df.to_csv('filename',sep=';',index=False)
现在我想在cython中的行上随机迭代,并对每一行进行一些计算。
import numpy as np
from readc_csv import row_pos, read_file_and_compute
filename = r'H:\mydata.csv'
row_position = row_pos(filename)[:-1] # returns the position of the start
# of each row in the file
# (excluding the header)
rows = np.random.choice(row_position,size=len(row_position),replace=False)
read_file_and_compute(filename,rows)
readc_csv.pyx文件如下所示
from libc.stdio cimport FILE, fopen, fgets, fclose, fseek, SEEK_SET, ftell
import numpy as np
cimport numpy as np
def row_pos(str filename):
filename_byte_string = filename.encode("UTF-8")
cdef:
char* fname = filename_byte_string
FILE* cfile
char line[50]
list pos = []
cfile = fopen(fname, "r")
while fgets(line, 50, cfile)!=NULL:
pos.append(ftell(cfile))
fclose(cfile)
return pos
def read_file_and_compute(str filename, int [:] rows):
filename_byte_string = filename.encode("UTF-8")
cdef:
char* fname = filename_byte_string
FILE* cfile
char line[50]
size_t j
int n = rows.shape[0]
cfile = fopen(fname, "r")
for j in range(n):
r = rows[j]
fseek(cfile,r,SEEK_SET)
fgets(line, 50, cfile)
# line is now e.g.
# '0.659933520847;0.471779123704;1.0;2.0\n'
# I want to convert it into an array with 4 elements
# each element corresponding to one of the numbers we
# see in the string
# and do some computations
fclose(cfile)
return
(注意:cython代码尚未优化) 背景信息:这是我想编写的脚本的一部分,用于随机梯度下降的数据集太大而无法读入内存。我想在cython中对随机排序的样本执行内循环。因此,我需要能够从cython中的csv文件中的给定行读取数据。
答案 0 :(得分:0)
我找到了可以从libc.string
和libc.stdlib
导入的c函数strtok
和atof
。他们可以做到这一点。
继续上面的例子,read_file_and_compute
函数可能看起来像这样
def read_file_and_compute(str filename, int [:] rows, int col_n):
filename_byte_string = filename.encode("UTF-8")
cdef:
char* fname = filename_byte_string
FILE* cfile
char line[50]
char *token
double *col = <double *>malloc(col_n * sizeof(double))
size_t j, i
int count
double num
int n = rows.shape[0]
cfile = fopen(fname, "r")
for j in range(n):
r = rows[j]
fseek(cfile,r,SEEK_SET)
fgets(line, 50, cfile)
token = strtok(line, ';') # splits the string at the delimiter ';'
count = 0
while token!=NULL and count<col_n:
num = atof(token) # converts the string into a float
col[count] = num
token = strtok(NULL,';\n')
count +=1
# now do some computations on col ...
fclose(cfile)
free(col)
return
还有更多将字符串转换为不同类型的函数,请参阅here。