我有一个我正在研究的学校项目,看起来结果毫无意义,但它更多地与我相信的经验相关。 我要做的是提交一个初始URL,然后拉出该页面上的所有URL并按顺序访问它们,直到我告诉它停止为止。所有URL都将记录在文本文件中。 到目前为止,我可以在IE中打开一个窗口并启动我选择的网页。所以现在我需要知道如何使用相同的会话将IE发送到新的网页,以及如何从我访问的网站扫描和提取数据。谢谢你的帮助!
到目前为止,这是我的代码:
#include <string>
#include <iostream>
#include <windows.h>
#include <stdio.h>
#include <tchar.h>
using namespace std;
int main( int argc, TCHAR *argv[] )
{
std::string uRL, prog;
int length, count;
STARTUPINFO si;
PROCESS_INFORMATION pi;
ZeroMemory( &si, sizeof(si) );
si.cb = sizeof(si);
ZeroMemory( &pi, sizeof(pi) );
//if( argc != 2 )
//{
// printf("Usage: %s [cmdline]\n", argv[0]);
// system("PAUSE");
// return 0;
//}
std::cout << "Enter URL: ";
std::cin >> uRL;
prog = ("C:\\Program Files\\Internet Explorer\\iexplore.exe ") + uRL;
char *cstr = new char[prog.length() + 1];
strcpy(cstr, prog.c_str());
// Start the child process.
if( !CreateProcess(NULL, // No module name (use command line)
_T(cstr), // Command line
NULL, // Process handle not inheritable
NULL, // Thread handle not inheritable
FALSE, // Set handle inheritance to FALSE
0, // No creation flags
NULL, // Use parent's environment block
NULL, // Use parent's starting directory
&si, // Pointer to STARTUPINFO structure
&pi ) // Pointer to PROCESS_INFORMATION structure
)
{
printf( "CreateProcess failed (%d).\n", GetLastError() );
system("PAUSE");
return 0;
}
cout << HRESULT get_Count(long *Count) << endl;
//cout << count << endl;
system("PAUSE");
// Wait until child process exits.
WaitForSingleObject( pi.hProcess, INFINITE );
// Close process and thread handles.
CloseHandle( pi.hProcess );
CloseHandle( pi.hThread );
delete [] cstr;
return 0;
}
答案 0 :(得分:1)
如果您要抓取网页,则启动Internet Explorer无法正常运行。我也不建议你自己解析HTML页面,除非你准备好了很多心痛和麻烦。相反,我建议您创建一个IWebBrowser2
对象的实例,并使用它导航到网页,抓取相应的IHTMLDocument2
对象并迭代选择URL的元素。它更容易,并且是使用已安装在Windows上的组件的常用方法。下面的例子应该让你开始,并像你正确的蜘蛛一样爬网。
#include <comutil.h> // _variant_t
#include <mshtml.h> // IHTMLDocument and IHTMLElement
#include <exdisp.h> // IWebBrowser2
#include <atlbase.h> // CComPtr
#include <string>
#include <iostream>
#include <vector>
// Make sure we link in the support library!
#pragma comment(lib, "comsuppw.lib")
// Load a webpage
HRESULT LoadWebpage(
const CComBSTR& webpageURL,
CComPtr<IWebBrowser2>& browser,
CComPtr<IHTMLDocument2>& document)
{
HRESULT hr;
VARIANT empty;
VariantInit(&empty);
// Navigate to the specifed webpage
hr = browser->Navigate(webpageURL, &empty, &empty, &empty, &empty);
// Wait for the load.
if(SUCCEEDED(hr))
{
READYSTATE state;
while(SUCCEEDED(hr = browser->get_ReadyState(&state)))
{
if(state == READYSTATE_COMPLETE) break;
}
}
// The browser now has a document object. Grab it.
if(SUCCEEDED(hr))
{
CComPtr<IDispatch> dispatch;
hr = browser->get_Document(&dispatch);
if(SUCCEEDED(hr) && dispatch != NULL)
{
hr = dispatch.QueryInterface<IHTMLDocument2>(&document);
}
else
{
hr = E_FAIL;
}
}
return hr;
}
void CrawlWebsite(const CComBSTR& webpage, std::vector<std::wstring>& urlList)
{
HRESULT hr;
// Create a browser object
CComPtr<IWebBrowser2> browser;
hr = CoCreateInstance(
CLSID_InternetExplorer,
NULL,
CLSCTX_SERVER,
IID_IWebBrowser2,
reinterpret_cast<void**>(&browser));
// Grab a web page
CComPtr<IHTMLDocument2> document;
if(SUCCEEDED(hr))
{
// Make sure these two items are scoped so CoUninitialize doesn't gump
// us up.
hr = LoadWebpage(webpage, browser, document);
}
// Grab all the anchors!
if(SUCCEEDED(hr))
{
CComPtr<IHTMLElementCollection> urls;
long count = 0;
hr = document->get_all(&urls);
if(SUCCEEDED(hr))
{
hr = urls->get_length(&count);
}
if(SUCCEEDED(hr))
{
for(long i = 0; i < count; i++)
{
CComPtr<IDispatch> element;
CComPtr<IHTMLAnchorElement> anchor;
// Get an IDispatch interface for the next option.
_variant_t index = i;
hr = urls->item( index, index, &element);
if(SUCCEEDED(hr))
{
hr = element->QueryInterface(
IID_IHTMLAnchorElement,
reinterpret_cast<void **>(&anchor));
}
if(SUCCEEDED(hr) && anchor != NULL)
{
CComBSTR url;
hr = anchor->get_href(&url);
if(SUCCEEDED(hr) && url != NULL)
{
urlList.push_back(std::wstring(url));
}
}
}
}
}
}
int main()
{
HRESULT hr;
hr = CoInitialize(NULL);
std::vector<std::wstring> urls;
CComBSTR webpage(L"http://cppreference.com");
CrawlWebsite(webpage, urls);
for(std::vector<std::wstring>::iterator it = urls.begin();
it != urls.end();
++it)
{
std::wcout << "URL: " << *it << std::endl;
}
CoUninitialize();
return 0;
}
答案 1 :(得分:0)
要从网站扫描和提取数据,您需要捕获HTML并遍历它,查找与特定模式匹配的所有字符序列。你有没有用过正则表达式?到目前为止,正则表达式在这里是最好的,但如果您理解它们(只需查看基础知识教程),那么您可以手动将模式识别概念应用于此项目。
所以你要找的东西就像http(s):// ..但它更复杂,因为域名是一个相当复杂的模式。你可能想要使用第三方HTML解析器或正则表达式库,但没有它它是可行的,尽管编程非常繁琐。
这是关于c ++中正则表达式的链接: http://www.johndcook.com/cpp_regex.html