如何使用PHP从html字符串中提取未标记的文本

时间:2014-11-24 12:40:57

标签: php

考虑以下HTML代码:

<strong>title</strong>
Hello World
<strong>Sub-Title</strong>
<div>This is just stuff</div>

如何清理字符串以返回没有标记的字符串,即“Hello World”。我认为这是使用DOM,并且如果有人有办法,并且不使用javascript或jquery,则更喜欢非正则表达式的答案。

[编辑]代码失败。

<span style="color: #677b8d"><strong>Short Description</strong><br/>Microsoft Office Home and Business 2013-Word, Excel, PowerPoint, OneNote and Outlook(Does not include Publisher or Access), DSP , No Warranty on Software <br/><br/><strong>Long description<br/></strong><div>Microsoft Office Home and Business 2013 32-bit/x64 DSP No Warranty on Software </div> <font face="Arial"> <div><br/><strong>Product Overview </strong> <div><font face="Arial">The New Microsoft Office Home &amp; Business 2013 is designed to help you create and communicate faster with new, time-saving features and a clean, modern look. Plus, you can save your documents in the cloud on SkyDrive and access them virtually anywhere. The latest versions of Word, Excel, PowerPoint, OneNote plus Outlook on 1 PC.</font></div> </div> <div><strong><br/> Features<br/></strong><font face="Arial">•One time purchase for the life of your PC; non-transferrable.<br/> •Office on one PC for business and household use.<br/> •The latest versions of Word, Excel, PowerPoint, OneNote, and Outlook.<br/> •7 GB of online storage in SkyDrive.<br/> •Free Office Web Apps* for accessing, editing, and sharing documents.<br/> •An improved user interface optimized for touch, pen, and keyboard.</font> <div> </div> <div><font face="Arial"><strong>Specifications<br/></strong>Operating System Windows <br/> Office/Productivity Software Office Suites &amp; Tools <br/> Purchase Method Boxed <br/> Users/Devices per License 1-User <br/></font></div> </div> <div><font face="Arial"><strong>System Requirements:<br/></strong>Computer and Processor 1 GHz or faster x86 or 64-bit processor with SSE2 instruction set</font></div> <div> <p><font face="Arial"><strong>Memory<br/></strong>1 GB RAM (32-bit); 2 GB RAM (64-bit) recommended for graphics features, Outlook Instant Search, and certain advanced functionality**</font></p> <p><font face="Arial"><strong>Hard Disk<br/></strong>3.0 GB available disk space</font></p> <p><font face="Arial"><strong>Display<br/></strong>1366 x 768 resolution</font></p> <p><font face="Arial"><strong>Operating System<br/></strong>Windows® 7, Windows 8, Windows Server 2008 R2 with .NET 3.5 or later</font></p> <p><font face="Arial"><strong>Graphics<br/></strong>Graphics hardware acceleration requires a DirectX10 graphics card</font></p> <p><font face="Arial"><strong>Additional Requirements<br/></strong>Internet connection. Fees may apply.</font></p> <p><font face="Arial">Microsoft Internet Explorer 8, 9, or 10; Mozilla Firefox 10.x or a later version; Apple Safari 5; or Google Chrome 17.x.</font></p> <p><font face="Arial">A touch-enabled device is required to use any multi-touch functionality. However, all features and functionality are always available by using a keyboard, mouse, or other standard or accessible input device. New touch features are optimized for use with Windows 8.</font></p> <p><font face="Arial">Information Rights Management features require access to a Windows 2003 Server with SP1 or later running Windows Rights Management Services.</font></p> <p><font face="Arial">Microsoft and Skype accounts.</font></p> <p><font face="Arial"><strong>Other<br/></strong>Product functionality and graphics may vary based on your system configuration. Some features may require additional or advanced hardware or server connectivity.</font></p> <p><font face="Arial">*An appropriate device, Internet connection and Internet Explorer, Firefox or Safari browser are required.<br/> **512 MB RAM recommended for accessing Outlook data files larger than 1GB<br/></font></p> </div> </font></span>

1 个答案:

答案 0 :(得分:1)

我建议你用有点异国情调的标签来包围代码,这绝对不会出现在代码本身中,例如:

 $a="<body><strong>title</strong>\nHello World\n<strong>Sub-Title</strong>\n<div>This is just stuff</div></body>";

然后使用DOM:

$doc = new DOMDocument();
$doc->loadHTML($a);
$xpath = new DOMXPath($doc);
$textnodes = $xpath->evaluate('//body/text()[not(normalize-space() = '')]');

现在你可以得到你想要的任何东西:

foreach( $textnodes as $el ) {
  print_r($el);
}

/*
DOMText Object
(
    [wholeText] => 
Hello World

    [data] => 
Hello World

    [length] => 13
    [nodeName] => #text
    [nodeValue] => 
Hello World

    [nodeType] => 3
    [parentNode] => (object value omitted)
    [childNodes] => 
    [firstChild] => 
    [lastChild] => 
    [previousSibling] => (object value omitted)
    [nextSibling] => (object value omitted)
    [attributes] => 
    [ownerDocument] => (object value omitted)
    [namespaceURI] => 
    [prefix] => 
    [localName] => 
    [baseURI] => 
    [textContent] => 
Hello World
*/