Question

我正在查看此网站：http://www.weil.com/michaelfrancies/

我想知道是否有更好的方法来获取数据，例如他们的教育或专业领域，而不是以下方面。我的目标是让我的程序尽可能通用，这样它就适用于网络上的任何传记页面。

我应该尝试使用nltk吗？

 #What happens if I don't specify any tags?
for i in dom:
    sib = str(i)
    #print len(sib)
    if len(sib) <= 100:
        for c in uni:
            if c in sib:
                collect.append(sib.strip())

np.unique(filter(lambda x: len(x) <= 100, collect))

很抱歉澄清：我知道如何使用模式和请求来使用路径。但是，我喜欢一种适用于许多网站的通用抓取工具。看来对于使用路径的程序，您必须预先标记要搜索的标记和类吗？

例如，一些网站，教育部分在＆＃39; p＆＃39;标签，在其他情况下，它在＆＃39; br＆＃39;

之下

输出

array([ 'Manchester University (LL.B.,&nbsp;1978);&nbsp;College of Law, London (LSF,&nbsp;1979)'], 
      dtype='|S86')

更新。

Answer 1

将lxml与xpath一起使用：

>>> import lxml.html
>>>
>>> tree = lxml.html.parse('http://www.weil.com/michaelfrancies/')
>>> root = tree.getroot()
>>> [x.tail.strip() for x in root.xpath('.//span[text()="Education"]/following-sibling::br')]
[u'Manchester University (LL.B.,\xa01978);\xa0College of Law, London (LSF,\xa01979)']

完成xpath：

.//span[text()="Education" or text()="Academic qualifications" or text()="LL.B"]/following-sibling::br

更多通用方式（标签不可知）从Grab Education和来自多个网站的其他非结构化数据？

1 个答案: