Question

我正在获取这样的XML文档：

<?xml version="1.0" encoding="UTF-8"?>
@namespace html url(http://www.w3.org/1999/xhtml); :root { font:small Verdana; font-wei.... huge list of styling
<items>
    <item>
    ...

第二行似乎阻止我解析文件。

使用Tidy

<?php
$config = array(
       'indent'     => true,
       'input-xml'  => true,
       'output-xml' => true,
       'wrap'       => false);
$tidy = new tidy;
$tidy->parseFile('https://website.com/path/to/XML.ashx?param=12345', $config);
$tidy->cleanRepair();

print_r($tidy);
?>

将导致：

tidy Object
(
    [errorBuffer] => 
    [value] => <?xml version="1.0" encoding="utf-8"?>

)

使用simplexml_load_file（）

<?php
$xml = simplexml_load_file('https://website.com/path/to/XML.ashx?param=12345');
print_r($xml);
?>

输出：

**Warning**: simplexml_load_file(): https://website.com/path/to/XML.ashx?param=12345:1: parser error : Start tag expected, '<' not found in **C:\xampp\htdocs\local\php\script.php** on line 2

**Warning**: simplexml_load_file(): <?xml version="1.0" encoding="utf-8" ?> in **C:\xampp\htdocs\local\php\script.php** on line 2

**Warning**: simplexml_load_file(): ^ in **C:\xampp\htdocs\local\php\script.php** on line 2

我还尝试过各种cURL选项，只需要file_get_contents（）

我的问题是：第二行XML是什么，我该如何解析这个文件？

Answer 1

在XML does not allow non-whitespace textnodes之后

XML-Declaration。所以你拥有的是无效的XML，这就是图书馆告诉你的。但是Tidy（2009年3月25日发布）可以解决这个问题：

$buffer = '<?xml version="1.0" encoding="UTF-8"?>
@namespace html url(http://www.w3.org/1999/xhtml); :root { font:small Verdana; font-wei.... huge list of styling
<items>
    <item></item> </items>';

$config = array(
    'indent'     => true,
    'input-xml'  => true,
    'output-xml' => true,
    'wrap'       => false);
$tidy = new tidy;
$tidy->parseString($buffer, $config);
$tidy->cleanRepair();

print_r($tidy);

输出：

tidy Object
(
    [errorBuffer] => line 2 column 1 - Warning: discarding unexpected plain text
    [value] => <?xml version="1.0" encoding="utf-8"?>
<items>
  <item></item>
</items>
)

所以你很可能对“XML”有更多的问题（或者如果你的线路非常大，那么它就是缓冲区的限制）。

由于这不是XML，您可能会问自己这是什么？这是CSS，你所拥有的是一个所谓的at-rule^Q&A，更具体的是CSS Namespace Declaration。（根据早期的CSS规范，浏览器（用户代理）不必支持任何这些。即使是当前的 CSS Selector API 也需要任何名称空间前缀解析才能在API中引发异常。一个很好的例子CSS命名空间与XML（XHTML）文档的使用情况在this earlier answer）。

你的文本块后面是命名空间前缀和它下面的CSS。

所以你所拥有的是不同数据的混合物。它不会解析为XML有效，你也找不到任何实际可以处理CSS的常见浏览器 - 即使它会验证 - 因为它不清楚该文本是否为CSS（它需要包含在内表示样式表的元素。）

Side-Note：正确的CSS解析器会将XML丢弃，因为它无效，CSS规范表示需要删除任何无效的内容。所以你在那里 - 在整体上 - 可以在技术上符合CSS文档。你认为它是XML，它只是CSS;）

因为这条@规则听起来很奇怪，实际上并非如此。它存在，只是不在这样的地方。

另一方面，将源代码掩盖为website.com并没有多大帮助 - 看到真实网站可能会提供更多背景信息来告诉您更多信息。

Answer 2

它不是XML，它是XHTML（Extensible HyperText Markup Language），下面是声明html命名空间，后面是CSS样式。因此，您的浏览器会理解它，但XML解析器不会理解它。

@namespace html url(w3.org/1999/xhtml);

HTML意味着符合XML，但看起来这个页面可能不符合严格的XHTML，因此不能解析为XML。

奇怪的“@namespace”和XML文档中的样式信息

使用Tidy

使用simplexml_load_file（）

2 个答案: