如何使用Python和MongoDB从随机URL中读取数据?

时间:2015-05-01 03:12:09

标签: python mongodb

目前我正在使用Python 3.4.3和MongoDB作为技术进行POC。

我需要在 www.socialmention.com 网站中搜索任何字符串,如“财务”或“Apple季度结果”等。结果将是多个URL,它将是随机的。现在我需要解析每个链接并阅读文章,评论,喜欢,用户详细信息等。

直到现在我成功地从社交网站捕获了随机链接URL,然后我的想法是在mongodb中创建一个博客字典并维护如下信息:

> db.blogs_dictionary.find().pretty()
{
    "_id" : ObjectId("55401455a1ce265d58f21049"),
    "blog_name" : "www.networkcomputing.com",
    "article" : "yes",
    "article_tag" : "div",
    "article_tag_type" : "id",
    "article_string" : "article-main",
    "article_multipage" : "yes",
    "article_multipage_tag" : "span",
    "article_multipage_tag_type" : "class",
    "article_multipage_tag_string" : "blue strong allcaps",
    "article_multipage_query_variable" : "page_number",
    "comments" : "yes",
    "comments_multipage" : "no",
    "comments_multipage_tag" : "",
    "comments_multipage_tag_type" : "",
    "comments_multipage_tag_string" : "",
    "comments_threaded" : "yes",
    "comments_threaded_query_variable" : "piddl_msgorder",
    "comments_threaded_query_value" : "thrd#msgs",
    "comments_main" : "yes",
    "comments_main_tag" : "div",
    "comments_main_tag_type" : "class",
    "comments_main_tag_string" : "comments-main",
    "user_name" : "yes",
    "user_name_tag" : "span",
    "user_name_tag_type" : "class",
    "user_name_tag_string" : "smaller strong black",
    "user_rank" : "yes",
    "user_rank_tag" : "span",
    "user_rank_tag_type" : "class",
    "user_rank_tag_string" : "smaller black",
    "comments_body" : "yes",
    "comments_body_tag" : "div",
    "comments_body_tag_type" : "class",
    "comments_body_tag_string" : "comment-body"
}

然后在python代码中使用一些东西,比如...如果来自socialmention网站的链接在我的博客dictonary ...那么检查文章和评论是否存在..如果存在然后通过URL打开URL并阅读所需内容....但为了实现这一切,我需要动态传递标签和搜索字符串

for i in db.social_mention.find({},{"blog_name":1,"_id":0}):
   for j in db.blogs_dictionary.find({},{"blog_name":1,"_id":0}):
      if i['blog_name']==j['blog_name']:
         link=db.social_mention.find_one({"blog_name":i['blog_name']},{"link":1,"_id":0})
         url=link['link']
         print (url)
         if (db.blogs_dictionary.find({"blog_name":j['blog_name']},{"article":1,"_id":0})) == "yes":
            article_variables=db.blogs_dictionary.find({"blog_name":j['blog_name']},{"article":1,"article_tag":1,"article_tag_type":1,"article_string":1,"article_multi":1,"article_multipage_tag":1,"article_multipage_tag_type":1,"article_multipage_tag_string":1,"article_multipage_query_variable":1,"_id":0}).pretty()
            soup = BeautifulSoup(urllib.request.urlopen(url))
            data=soup.find(article_variables['article_tag'],article_variables['article_tag_type']=article_variables['article_string'])
            print (data.text)

但我得到的错误就像关键字不能是表达式。还有其他方法可以做到这一点,还是应该改变我的设计?

1 个答案:

答案 0 :(得分:0)

我认为您想使用属性字典attrs来呼叫find()

data = soup.find(article_variables['article_tag'],
                 attrs={article_variables['article_tag_type']: article_variables['article_string']})

原因:您无法使用字符串为标识符传递关键字参数,即在

article_variables['article_tag_type']=article_variables['article_string']

article_variables['article_tag_type']不是关键字参数的有效标识符。一般的解决方法是使用字典并将其解压缩如下:

kwargs = {article_variables['article_tag_type']: article_variables['article_string']}
data=soup.find(article_variables['article_tag'], **kwargs)

但是,由于find()接受attrs字典,您可以直接传递它。