我正在开展一些项目,我想阅读并解析6GB大小的100000+文件。
我的问题: 1 GT;在几秒钟内读取和解析一个XML文件(大小在5kb-500kb之间)。 所以完整的XML文件集(100000+文件,大小为6GB)阅读&在3-5小时内解析。 2 - ;最快的方法
目前,一个XML文件(5KB-500KB)需要花费一分钟的时间来阅读和解析。
此致 绵
P.S。还请查看代码:
<HTML>
<HEAD>
<META HTTP-EQUIV="CACHE-CONTROL" CONTENT="NO-CACHE">
<META HTTP-EQUIV="EXPIRES" CONTENT="0">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"><style type="text/css">
<!--
body,td,th {
color: #CCCCCC;
}
body {
background-color: #000066;
}
-->
</style></HEAD>
</BODY>
<script>
<!--
/*
Auto Refresh Page with Time script By JavaScript Kit (javascriptkit.com) Over 200+ free scripts here!
*/
//enter refresh time in "minutes:seconds" Minutes should range from 0 to inifinity. Seconds should range from 0 to 59
var limit="00:10"
if (document.images){
var parselimit=limit.split(":")
parselimit=parselimit[0]*60+parselimit[1]*1
}
function beginrefresh(){
if (!document.images)
return
if (parselimit==1)
window.location.reload()
else{
parselimit-=1
curmin=Math.floor(parselimit/60)
cursec=parselimit%60
if (curmin!=0)
curtime=curmin+" minutes and "+cursec+" seconds left until page refresh!"
else
curtime=cursec+" seconds left until page refresh!"
window.status=curtime
setTimeout("beginrefresh()",1000)
}
}
window.onload=beginrefresh
//-->
</script>
</HEAD>
<BODY>
<?php
require("MagicParser.php");
//header("Content-Type: text/plain");
$dbServer = "127.0.0.1";
$dbUser = "root";
$dbPass = "";
$dbName = "GDatabase";
$text = '';
$c = mysql_connect($dbServer, $dbUser, $dbPass) or die("Couldn't connect to database");
$d = mysql_select_db($dbName) or die("Couldn't select database");
//mysql_query("SET NAMES utf8;");
//mysql_query("SET CHARACTER_SET utf8;");
$sql = "select
id, file_name
from
tableP_files
where status = '' limit 1";
$result = mysql_query($sql,$c);
while($row = mysql_fetch_array($result))
{
$id = $row['id'];
$file_name = $row['file_name'];
$url = 'http://localhost/GDatabase/XML/' . $file_name;
}
$formatString = MagicParser_getFormat($url);
$update_query = "update tableP_files set format_string = '$formatString' where id = $id";
if(!mysql_query($update_query,$c))
{
echo 'ERROR';
}
print "Format String: ".$formatString."\n\n";
// MagicParser_parse($url,"myRecordHandler",$formatString);
// MagicParser_parse($url,"myRecordHandler","xml|ARTICLE/FLOATS-WRAP/TABLE-WRAP/TABLE/TBODY/TR/TD/");
MagicParser_parse($url,"myRecordHandler","xml|ARTICLE/");
function myRecordHandler($record)
{
$dbServer = "127.0.0.1";
$dbUser = "root";
$dbPass = "";
$dbName = "GDatabase";
$c = mysql_connect($dbServer, $dbUser, $dbPass) or die("Couldn't connect to database");
$d = mysql_select_db($dbName) or die("Couldn't select database");
mysql_query("SET NAMES utf8;");
mysql_query("SET CHARACTER_SET utf8;");
$sql = "select
id, file_name
from
tableP_files
where status = '' limit 1";
$result = mysql_query($sql,$c);
while($row = mysql_fetch_array($result))
{
$id = $row['id'];
$file_name = $row['file_name'];
$file_name = 'http://localhost/GDatabase/test/' . $file_name;
}
foreach($record as $key => $value)
{
$tag = addslashes($key);
$value = addslashes($value);
$insert_query = "insert into tableP_xml set file_id = '$id', file_name = '$file_name', tag = '$tag', value = '$value', status = ''";
if(!mysql_query($insert_query,$c))
{
echo 'ERROR';
}
}
$update_query = "update tableP_files set status = 'done' where id = $id";
if(!mysql_query($update_query,$c))
{
echo 'ERROR';
}
echo "Done: " . $id . " - " . $file_name;
return TRUE;
}
?>
</BODY>
</HTML>
答案 0 :(得分:1)
我刚刚创建了每个大小为60kb的100000个xml文件,并且在php中试图用file_get_contents读取它们,花了87.5秒。提个醒!我是一个ssd,有大量的ram和强大的i5第四代处理器。只需将约90秒加载到内存中即可。
那么,你如何更快地做到这一点?并发性。
我将任务分成4块25000xml文件,将文件加载到内存(按顺序)的时间减少到~30秒。同样,这只是将xml加载到内存中的时间。因此,如果您要对xml进行更多处理,则需要更多处理能力或时间。
现在,你如何扩展这个?输入gearman。 Gearman允许您通过中央服务器将工作分发给工作人员来处理并行任务。您甚至可以让不同服务器上的一群工作人员注册执行您的任务。我认为你根本不需要超级计算机。您只需要定义所有作业一次,让工作人员完成工作(异步?)。