我必须解析HTML url并返回url列表以使用递归的解析方法。我在Mac OS上使用BeautifulSoup,我有一个问题需要导入html_parser.py
html_parser.py:
#!/usr/local/bin/python2.7
from bs4 import BeautifulSoup
import urllib2
def link_list(urlString):
siteFile = urllib2.urlopen(urlString)
siteHTML = siteFile.read()
siteFile.close()
soup = BeautifulSoup(siteHTML, "html.parser")
liste = []
for links in soup.find_all('a'):
print(links.get('href'))
liste.append(links.get('href'))
return liste
pars.c:
#include <stdio.h>
#include <Python.h>
int main() {
Py_Initialize();
/* 1st: Import the module */
PyRun_SimpleString("from bs4 import BeautifulSoup\n");
PySys_SetPath(".");
PyObject* moduleString = PyString_FromString((char*) "html_parser");
if (!moduleString) {
PyErr_Print();
printf("Error formating python script\n");
}
PyObject* module = PyImport_Import(moduleString);
if (!module) {
PyErr_Print();
printf("Error importing python script\n");
}
/* 2nd: Getting reference to the function */
PyObject* function = PyObject_GetAttrString(module, (char*)"link_list");
if (!function) {
PyErr_Print();
printf("Pass valid argument to link_list()\n");
}
Py_Finalize();
return 0;
}
我需要使用PySys_SetPath(".")
将Python Path设置为我的工作目录。但是通过这样做它并不能识别bs4,所以在改变路径之前我使用PyRun_SimpleString("from bs4 import BeautifulSoup\n")
但是当我试图为urllib2(PyRun_SimpleString("import urllib2\n")
)做同样的事情时我得到了这个错误:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 94, in <module>
import httplib
File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 80, in <module>
import mimetools
File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/mimetools.py", line 6, in <module>
import tempfile
File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/tempfile.py", line 32, in <module>
import io as _io
File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/io.py", line 51, in <module>
import _io
ImportError: dlopen(/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload/_io.so, 2): Symbol not found: __PyCodecInfo_GetIncrementalDecoder
Referenced from: /usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload/_io.so
Expected in: flat namespace
in /usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload/_io.so
有人帮助了我,但在我的Python程序名称为parser.py
之前,当我们不使用SetPath时,c使用另一个文件int默认的Python Path。所以Python.h不识别ether bs4和urllbi2。
编辑:
我刚用PyRun_SimpleString("print(sys.version)")
检查了我的系统版本并得到了这个:
2.7.10 (default, Jul 30 2016, 19:40:32)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)]
所以我在Python 2.7.10而不是Python 3上,我不必使用url.request模块......