Question

我正在解析html文件，并获取预标记的内容，然后将其保存到文本文件中。

然而，当我在sublime或其他文本编辑器中打开文本文件时，表单格式化已经消失，我的问题：如何在txt文件中将文本保存为原始状态。

pre的内容在此之下：

          x4                                          x4
|---------------------|-|-------------------|--------------------|
|---------------------|-|-------------------|--------------------|
|----------2-0-0------|-|-------------------|--------------------|
|----------------1-0-0|-|-------------------|--------------------|
|3-0-1-3-0------------|0|1-3-1-3-1-3-1-0----|1-3-1-3-1-3-1-0---0-|


           x4                                    x4
|------------------------|-------------|-------------------|
|------------------------|-------------|-------------------|
|------------------------|-------------|-------------------|
|------------------------|-------------|0--0033------------|
|1-3-1-3-1-3-1-0--0000--0|1-3-1-3-1-3-1|--------333~-335-0-|


            x4                      x4
|------------------------|---------------------|-|-------------|
|------------------------|---------------------|-|-------------|
|------------------------|----------2-0-0------|-|-------------|
|------------------------|----------------1-0-0|-|-------------|
|0--0000--0-1-3-1-3-1-3-1|3-0-1-3-0------------|0|1-3-1-3-1-3-1|

我的代码：

<?php
     // example of how to use basic selector to retrieve HTML contents
     include('simple_html_dom.php');

     // get DOM from URL or file
     $html = file_get_html('http://metaltabs.com/tab/10464/index.html');

     foreach($html->find('title') as $e)
       echo $e->innertext . '<br>';
       $my_file = fopen("textfile.txt", "w") or die("Unable to open file!");


    foreach($html->find('pre') as $e)
       echo nl2br($e->innertext) . '<br>';
       $txt = $e->innertext;
       fwrite($my_file, $txt);
       fclose($my_file);
?>

Answer 1

解析结果的问题是：

不保留换行符;
保留HTML实体。

要解决换行问题，您必须使用->load()代替file_get_html：

$html = new simple_html_dom();
$data = file_get_contents( 'http://metaltabs.com/tab/10464/index.html' );

$html->load( $data , True, False );
/*                   └─┬┘  └─┬─┘
       Optional parameter  Optional parameter
                lowercase  Strip \r\n
*/

要解决实体问题，可以使用php函数``：

$txt = html_entity_decode( $e->innertext );

结果是这样的：

Tuning E A D G B E

|------------------------------------------------------------|
|------------------------------------------------------------|
|------------------------------------------------------------|
|------------------------------------------------------------|
|-------<7-8>----------<10-11>---------<7-8>---7--10--8--11--|x9
|-0000-----------0000------------0000----------0-------------|

Answer 2

我尝试了这段代码，并使用升华文字打开，文本文件保留了与您网站相同的格式：

$html = file_get_contents("http://metaltabs.com/tab/4086/index.html");

$dom = new domDocument('1.0', 'utf-8');
// load the html into the object
$dom->loadHTML($html);
//preserve white space
$dom->preserveWhiteSpace = true;
$pre= $dom->getElementsByTagName('pre');

$file = fopen('text.txt', 'w');
fwrite($file, $pre->item(0)->nodeValue);
fclose($file);

这假设您确定页面中只有一个预标记，否则您必须遍历$ pre变量

php解析HTML获取PRE文本并将其保存到文件

2 个答案: