如何过滤结果? Html Dom Parser

时间:2017-12-14 15:56:43

标签: php html parsing dom screen-scraping

我收到了以下代码:

<?php
    include('simple_html_dom.php');
    $html = file_get_html('http://www.google.com/search?q=BA236',false);
    $e = $html->find("div[class=g]");
echo $e[0]->innertext;
?>

当我运行它时,我得到了谷歌搜索结果的第一类,即:

British Airways Flight 236

Scheduled   departs in 13 hours 13 mins

Departure   DME 5:40 AM     —

Moscow  Dec 15

Arrival LHR 6:55 AM     Terminal 5

London  Dec 15

Scheduled   departs in 1 day 13 hours

Departure   DME 5:40 AM     —

Moscow  Dec 16

Arrival LHR 6:55 AM     Terminal 5

London  Dec 16

我的问题是我不需要所有这些信息,我不知道如何过滤这个回声,因为Html代码没有id或类。 我想用jquery或简单的CSS隐藏我不需要的html但是:同样的问题,我没有id或类来调用它们。

那么如何过滤掉我不想要的信息呢? 请给我看一个例子,我会检查我需要移除自己的HTML。感谢。

1 个答案:

答案 0 :(得分:0)

您要搜索的内容称为cmake -DCMAKE_LEGACY_CYGWIN_WIN32=0 ..工具(或正则表达式)。 有关可能的答案,请参阅SO网站的PHP to search within txt file and echo the whole line。稍微修改一下您的应用程序:

# Project
project(test)

set(CMAKE_LEGACY_CYGWIN_WIN32 0)
set(CMAKE_VERBOSE_MAKEFILE ON)
# Version Number
set(VERSION_MAJOR 0)
set(VERSION_MINOR 0)

# CMAKE version
cmake_minimum_required(VERSION 2.8)

# C++11 support
if(WIN32)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11")
else()
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=gnu++11")
endif(WIN32)

set(CMAKE_MODULE_PATH "${CMAKE_SOURCE_DIR}/cmake")
set(CMAKE_INSTALL_PREFIX "${CMAKE_SOURCE_DIR}/bin")

set(CMAKE_BUILD_TYPE Debug)

# Malloc & Unistd
INCLUDE (CheckIncludeFiles)
CHECK_INCLUDE_FILES (malloc.h HAVE_MALLOC_H)
add_definitions(-DHAVE_MALLOC_H)
CHECK_INCLUDE_FILES (unistd.h HAVE_UNISTD_H)
add_definitions(-DHAVE_UNISTD_H)

# General include, for third party header files.
message(STATUS "Including /include")
include_directories(${PROJECT_SOURCE_DIR}/include)

find_package(Boost COMPONENTS filesystem REQUIRED)

# Subdirectories
include_directories("${CMAKE_CURRENT_SOURCE_DIR}/src")
add_subdirectory(src)

修改

或者,如评论中所述,使用grep代替$contents = 'British Airways Flight 236\n\nScheduled departs in 13 hours 13 mins\n\nDeparture DME 5:40 AM —\n\Moscow Dec 15\n\n...' $searchfor = 'departs'; $pattern = preg_quote($searchfor, '/'); // finalise the regular expression, matching the whole line $pattern = "/^.*$pattern.*\$/m"; // search, and store all matching occurences in $matches if (preg_match_all($pattern, $contents, $matches)) { echo "Found matches:\n"; echo implode("\n", $matches[0]); } else { echo "No matches found"; } 来保留HTML结构,以便于解析。