将DistributedMatrix转换为Scipy稀疏或Numpy数组

时间:2019-01-08 01:01:50

标签: python numpy apache-spark scipy pyspark

如何将某些DistributedMatrix转换回Numpy数组或Scipy稀疏数组?

显然,这不是我要在大数组上执行的操作,但是在实际对大数据运行之前,这有助于调试和测试代码。

1 个答案:

答案 0 :(得分:0)

这是从s = requests.session() page = s.get('https://samozapis-spb.ru/moskovskiy-rayon/ctomatologicheskaya-poliklinika-no12') soup = BeautifulSoup(page.text, 'html.parser') # get "data-lid" from the page spec = soup.find("div", id="spec") # do ajax request data = {"lid": spec["data-lid"]} headers = {"x-requested-with" : "XMLHttpRequest"} ajax = s.post('https://samozapis-spb.ru/_api_v3/spec.php', data=data, headers=headers).json() spec = soup.find("div", id="spec") soup = BeautifulSoup(ajax['html'], 'html.parser') doctors = soup.select("a[class='ax list-group-item']")[2:] print(doctors) 到Scipy稀疏矩阵的天真的转换:

IndexedRowMatrix

from scipy.sparse import lil_matrix def indexedrowmatrix_to_array(x): output = lil_matrix((x.numRows(), x.numCols()) for indexed_row in x.rows.collect(): output[indexed_row.index] = indexed_row.vector return output

CoordinateMatrix

您可以通过遍历from scipy.sparse import coo_matrix def coordinatematrix_to_array(x): output = coo_matrix((x.numRows(), x.numCols()) for matrix_entry in x.entries.collect(): output[matrix_entry.i, matrix_entry.j] = matrix_entry.value return output 属性并使用BlockMatrixblocks属性来分块分配,从而为rowsPerBlock做类似的事情。