如何显式构建稀疏stringdistmatrix以避免耗尽内存?

时间:2019-06-23 20:43:52

标签: r sparse-matrix stringdist

将“数据”向量中的大量略有不同的餐厅名称与相应的“匹配”向量相匹配:

stringdistmatrix程序包中的stringdist函数很棒,但是用完了大约10k x 10k的内存,我的数据更大了。

尝试as(stringdistmatrix(data, match),'sparseMatrix')会带来希望的结果,但会耗尽内存。因此,我想使用sparseMatrix(i,j,x,dims,dimnames)x或由相似的字符串距离计算的adist()显式建立索引对,以希望它适合内存。

R

data <- c("McDonalds", "MacDonalds", "Mc Donald's", "Wendy's", "Wendys", "Wendy", 
          "Chipotle", "Chipotle's")

match <- c("McDonalds", "Wendys", "Chipotle")

尝试:

library(Matrix)
library(stringdist)

idx <- expand.grid(a=data,b=match)
idx$row <- match(idx$a,idx$b)
idx$col <- match(idx$b,idx$a)

library(Matrix)
sparseMatrix(i=idx$row, 
             j=idx$col,
             x=ifthen(adist(data,match)<2,1,0),
             dims=c(7,3),
             dimnames = list(data, match))

希望输出匹配:

library(stringdist)
as(ifelse(stringdistmatrix(data,match)<2,1,0),'sparseMatrix')

1 个答案:

答案 0 :(得分:1)

如果我正确理解了您的问题,那么您的任务就是将脏字符串与干净字符串匹配。您不需要为此使用整个矩阵(并且确实不会稀疏)。相反,您可以使用$params = []; $array = []; $sql = "SELECT lc.*, py.land_contract_annual_price_year AS `year`, py.land_contract_annual_price_amount AS `amount` FROM land_contract AS lc LEFT JOIN land_contract_annual_price AS py ON py.land_contract_id = lc.land_contract_id "; if (isset($_POST['land_contract_id'])) { $sql .= 'WHERE lc.land_contract_id = ?'; $params[] = $_POST["land_contract_id"]; } $stmt = $pdo->prepare($sql); $stmt->execute($params); while ($row = $stmt->fetch()) { // Fields we want to extract from the select statement into the array $select_fields = ['land_contract_id', 'land_contract_name', 'location_id', 'land_contract_link', 'land_contract_notes', 'land_owner_id', 'land_contract_start_date', 'land_contract_end_date', 'land_contract_terminated', 'land_contract_payment_interval', 'land_contract_price_type', 'land_contract_fixed_annual_price ']; if (!isset($array[$row['land_contract_id']])) { // initialize the subarray if it has not been set already $array[$row['land_contract_id']] = array_intersect_key($row, array_flip($select_fields)); if ($row['year'] != null) { $array[$row['land_contract_id']]['land_contract_annual_prices'] = []; } else { $array[$row['land_contract_id']]['land_contract_annual_price'] = $row['land_contract_fixed_annual_price']; } } if ($row['year'] != null) { $array[$row['land_contract_id']]['land_contract_annual_prices'][] = ['year' => $row['year'], 'amount' => $row['amount']]; } } if (empty($array)) { echo "No results"; exit; } echo json_encode($array, JSON_UNESCAPED_UNICODE);

amatch