Question

我有一个问题从HTML表解析单词。我需要将这些词与其他内容分开（“引理”栏目）：

俄语版的原始版本 - http://hsu.su/st2

英语（googletranslate） - http://hsu.su/155

我听说过PHP Simple HTML DOM Parser http://simplehtmldom.sourceforge.net/，但我无法弄清楚如何与他一起解决这个问题。

Answer 1

<?php
    include_once('simplehtmldom/simple_html_dom.php');
    $html = file_get_html('http://dict.ruslang.ru/freq.php?act=show&dic=freq_news_comp&title=%D1%EB%EE%E2%E0%F0%FC%20%E7%ED%E0%F7%E8%EC%EE%E9%20%E3%E0%E7%E5%F2%ED%EE-%ED%EE%E2%EE%F1%F2%ED%EE%E9%20%EB%E5%EA%F1%E8%EA%E8');

    $myFile = "file.txt";
    $fh = fopen($myFile, 'w') or die("can't open file");


    $table=$html->find('table',1);
    foreach($table->find('td') as $td)
    fwrite($fh, $td->plaintext);

    fclose($fh);
    ?>

从您提供的同一链接下载simplehtmldom ..

将其复制到同一文件夹

确保代码中包含的路径引用正确的类

将file.txt文件放在同一个文件夹中..

并运行代码......

你有

 '&nbsp;'

额外的，您可以从PHP字符串函数中删除..

Answer 2

查看PHP函数strip_tags()。

将单词与html表分开并将它们保存在txt文件中

2 个答案: