目前我正在使用Python 3.4.3和MongoDB作为技术进行POC。
我需要在 www.socialmention.com 网站中搜索任何字符串,如“财务”或“Apple季度结果”等。结果将是多个URL,它将是随机的。现在我需要解析每个链接并阅读文章,评论,喜欢,用户详细信息等。
直到现在我成功地从社交网站捕获了随机链接URL,然后我的想法是在mongodb中创建一个博客字典并维护如下信息:
> db.blogs_dictionary.find().pretty()
{
"_id" : ObjectId("55401455a1ce265d58f21049"),
"blog_name" : "www.networkcomputing.com",
"article" : "yes",
"article_tag" : "div",
"article_tag_type" : "id",
"article_string" : "article-main",
"article_multipage" : "yes",
"article_multipage_tag" : "span",
"article_multipage_tag_type" : "class",
"article_multipage_tag_string" : "blue strong allcaps",
"article_multipage_query_variable" : "page_number",
"comments" : "yes",
"comments_multipage" : "no",
"comments_multipage_tag" : "",
"comments_multipage_tag_type" : "",
"comments_multipage_tag_string" : "",
"comments_threaded" : "yes",
"comments_threaded_query_variable" : "piddl_msgorder",
"comments_threaded_query_value" : "thrd#msgs",
"comments_main" : "yes",
"comments_main_tag" : "div",
"comments_main_tag_type" : "class",
"comments_main_tag_string" : "comments-main",
"user_name" : "yes",
"user_name_tag" : "span",
"user_name_tag_type" : "class",
"user_name_tag_string" : "smaller strong black",
"user_rank" : "yes",
"user_rank_tag" : "span",
"user_rank_tag_type" : "class",
"user_rank_tag_string" : "smaller black",
"comments_body" : "yes",
"comments_body_tag" : "div",
"comments_body_tag_type" : "class",
"comments_body_tag_string" : "comment-body"
}
然后在python代码中使用一些东西,比如...如果来自socialmention网站的链接在我的博客dictonary ...那么检查文章和评论是否存在..如果存在然后通过URL打开URL并阅读所需内容....但为了实现这一切,我需要动态传递标签和搜索字符串
for i in db.social_mention.find({},{"blog_name":1,"_id":0}):
for j in db.blogs_dictionary.find({},{"blog_name":1,"_id":0}):
if i['blog_name']==j['blog_name']:
link=db.social_mention.find_one({"blog_name":i['blog_name']},{"link":1,"_id":0})
url=link['link']
print (url)
if (db.blogs_dictionary.find({"blog_name":j['blog_name']},{"article":1,"_id":0})) == "yes":
article_variables=db.blogs_dictionary.find({"blog_name":j['blog_name']},{"article":1,"article_tag":1,"article_tag_type":1,"article_string":1,"article_multi":1,"article_multipage_tag":1,"article_multipage_tag_type":1,"article_multipage_tag_string":1,"article_multipage_query_variable":1,"_id":0}).pretty()
soup = BeautifulSoup(urllib.request.urlopen(url))
data=soup.find(article_variables['article_tag'],article_variables['article_tag_type']=article_variables['article_string'])
print (data.text)
但我得到的错误就像关键字不能是表达式。还有其他方法可以做到这一点,还是应该改变我的设计?
答案 0 :(得分:0)
我认为您想使用属性字典attrs
来呼叫find()
:
data = soup.find(article_variables['article_tag'],
attrs={article_variables['article_tag_type']: article_variables['article_string']})
原因:您无法使用字符串为标识符传递关键字参数,即在
中article_variables['article_tag_type']=article_variables['article_string']
article_variables['article_tag_type']
不是关键字参数的有效标识符。一般的解决方法是使用字典并将其解压缩如下:
kwargs = {article_variables['article_tag_type']: article_variables['article_string']}
data=soup.find(article_variables['article_tag'], **kwargs)
但是,由于find()
接受attrs
字典,您可以直接传递它。