Question

我在Python工作，我有一个存储在文本文件中的矩阵。文本文件的格式如下：

row_id，col_id
row_id，col_id
...

row_id，col_id

row_id和col_id是整数，它们取值从0到n（为了知道n为row_id和col_id，我必须先扫描整个文件）。

没有标题，row_ids和col_ids在文件中多次出现，但每个组合row_id，col_id出现一次。每个组合row_id，col_id没有明确的值，实际上每个单元格值为1.文件的大小几乎是1 GB。

不幸的是，文件很难在内存中处理，事实上，对于26622704元素，它是2257205 row_ids和122905 col_ids。所以我一直在寻找更好的方法来处理它。矩阵市场格式可能是一种处理它的方式。

使用Python将这个文件转换为市场矩阵格式（http://math.nist.gov/MatrixMarket/formats.html#mtx）的文件是否有快速且内存有效的方法？

Answer 1

有一种处理此类矩阵的快速且内存有效的方法：使用sparse matrices offered by SciPy（这是Python中事实上的标准用于此类事情）。

对于N的大小为N的矩阵：

from scipy.sparse import lil_matrix

result = lil_matrix((N, N))  # In order to save memory, one may add: dtype=bool, or dtype=numpy.int8

with open('matrix.csv') as input_file:
    for line in input_file:
        x, y = map(int, line.split(',', 1))  # The "1" is only here to speed the splitting up
        result[x, y] = 1

（或者，在一行中而不是两行：result[map(int, line.split(',', 1))] = 1）。

赋予1的参数split()就是为了在解析坐标时加快速度：它指示Python在找到第一个（也是唯一的）逗号时停止解析该行。这可能很重要，因为您正在读取1 GB文件。

根据您的需要，您可能会发现SciPy提供的the other six sparse matrix representations之一更适合。

如果你想要一个更快但也更耗费内存的数组，你可以使用result = numpy.array(…)（使用NumPy）。

Answer 2

除非我遗漏了某些东西......

MatrixMarket MM格式是具有尺寸和“行列值”的行。如果您已经有行和列，并且所有值都是1，则只需添加值即可。

在

中简单地使用sed会不会更容易

<?php
include('dbconnect.php');

$name = $_POST['commentName'];
$email = $_POST['commentEmail'];

$website = $_POST['commentWebsite'];
if( $website != ''){
    if  ( $ret = parse_url($website) ) {

          if ( !isset($ret["scheme"]) )
           {
           $website = "http://{$website}";
           }
    }
}

$comment = $_POST['comment'];
$date = date('Y-m-d H:i:s');
$post_id = $_GET['PostID'];

$blogAuthor = '';
if( $name == "Luke Twomey"){
    $blogAuthor = "<span> - Blog Author</span>";
}else{
    $blogAuthor = false;
}

$SQL = "INSERT INTO comments (name, email, website, comment, date, post_id) VALUES ('$name', '$email', '$website', '$comment', '$date', '$post_id')";
mysqli_query($link, $SQL);

echo "<section class='comment'>
            <h3 class='commentAuthor'>$name$blogAuthor</h3>
            <a href='$website'><p class='commentAuthorWebsite'>$website</p></a>
            <p class='postDate'>$date</p>
            <p>$comment</p>
        </section>";

$subject = $name . $_POST['subject'];
$post_url = $_POST['post_url'];
$postedMessage = $_POST['comment'];
$contentForEmail = $postedMessage.'<br><a href="http://www.fakedomainhere.com/blog/'.$post_url.'#comments"><p>View comment on website</p></a>';

$header = "From: fake-email-here\n"
. "Reply-To: fake-email-here\n" . "Content-Type: text/html; charset=ISO-8859-1\r\n";

$email_to = "fake-email-here";

mail($email_to, $subject , $contentForEmail, $header );


?>

如果你的坐标是一个偏移，那应该有效。如果它们是零偏移，则应为每个坐标添加+1，只需读取坐标，为每个坐标添加一个并打印coordx，coordy，“1”。您可以从shell，Awk或python中轻松完成。

Q＆amp; D代码（未经测试，仅作为提示生成，YMMV，您可能希望预处理文件以计算某些值）：

在shell中

n=`wc -l file`
echo "2257205 122905 $n" > file.mm
cat file | sed -e 's/$/ 1/g' >> file.mm

在python中，或多或少......

echo "2257205 122905 $n"
cat file | while read x,y ; do x=$((x+1)); y=$((y+1)); echo "$x $y 1" ; done

或者我错过了什么？

从文本文件到市场矩阵格式

2 个答案: