Question

我想从here获取所有索引词及其定义。是否可以使用Python抓取Web内容？

Firebug探索显示以下网址返回我想要的内容，包括索引及其对'a'的定义。

http://pali.hum.ku.dk/cgi-bin/cpd/pali?acti=xart&arid=14179&sphra=undefined

使用的模块是什么？有没有可用的教程？

我不知道字典中索引了多少单词。我是编程的绝对初学者。

Answer 1

您应该使用urllib2来获取URL内容，使用BeautifulSoup来解析HTML / XML。

示例 - 从StackOverflow.com主页检索所有问题：

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen("http://stackoverflow.com")
soup = BeautifulSoup(page)

for incident in soup('h3'):
    print [i.decode('utf8') for i in incident.contents]
    print

此代码示例改编自BeautifulSoup documentation。

Answer 2

您可以使用内置的urllib或urllib2从Web获取数据，但解析本身是最重要的部分。我可以推荐一下美妙的BeautifulSoup吗？它可以处理任何事情。 http://www.crummy.com/software/BeautifulSoup/

文档就像教程一样构建。八九不离十： http://www.crummy.com/software/BeautifulSoup/documentation.html

在您的情况下，您可能需要使用通配符来查看字典中的所有条目。你可以这样做：

import urllib2

def getArticles(query, start_index, count):
    xml = urllib2.urlopen('http://pali.hum.ku.dk/cgi-bin/cpd/pali?' +
                          'acti=xsea&tsearch=%s&rfield=entr&recf=%d&recc=%d' %
                          (query, start_index, count))

    # TODO:
    # parse xml code here (using BeautifulSoup or an xml parser like Python's
    # own xml.etree. We should at least have the name and ID for each article.
    # article = (article_name, article_id)

    return (article_names # a list of parsed names from XML

def getArticleContent(article):
    xml = urllib2.urlopen('http://pali.hum.ku.dk/cgi-bin/cpd/pali?' +
                          'acti=xart&arid=%d&sphra=undefined' % article_id)

    # TODO: parse xml
    return parsed_article

现在你可以循环一下了。例如，要获取所有以'ana'开头的文章，请使用通配符'ana *'，然后循环直到没有结果：

query = 'ana*'
article_dict = {}
i = 0
while (true):
    new_articles = getArticles(query, i, 100)
    if len(new_articles) == 0:
        break

    i += 100
    for article_name, article_id in new_articles:
        article_dict[article_name] = getArticleContent(article_id)

完成后，您将拥有一个由名称引用的所有文章内容的字典。我省略了解析本身，但在这种情况下它非常简单，因为一切都是XML。您可能甚至不需要使用BeautifulSoup（尽管它仍然很方便且易于使用XML）。

用Python刮痧？

2 个答案: