此处有很多帖子询问如何在Google上进行自动搜索。我选择使用BeautifulSoup,并阅读了许多有关它的问题。我找不到我的问题的直接答案,虽然具体任务似乎很平常。我的下面的代码是不言自明的,括号内的部分是我遇到麻烦的地方(编辑通过“遇到麻烦”我的意思是我无法弄清楚如何为这部分实现我的伪代码,在阅读文档并在线搜索类似的问题代码后,我仍然不知道该怎么做)。如果它有帮助,我认为我的问题可能非常类似于在PubMed上进行自动搜索以找到感兴趣的特定文章的任何人。非常感谢。
#Find Description
import BeautifulSoup
import csv
import urllib
import urllib2
input_csv = "Company.csv"
output_csv = "output.csv"
def main():
with open(input_csv, "rb") as infile:
input_fields = ("Name")
reader = csv.DictReader(infile, fieldnames = input_fields)
with open(output_csv, "wb") as outfile:
output_fields = ("Name", "Description")
writer = csv.DictWriter(outfile, fieldnames = output_fields)
writer.writerow(dict((h,h) for h in output_fields))
next(reader)
first_row = next(reader)
for next_row in reader:
search_term = first_row["Name"]
url = "http://google.com/search?q=%s" % urllib.quote_plus(search_term)
#STEP ONE: Enter "search term" into Google Search
#req = urllib2.Request(url, None, {'User-Agent':'Google Chrome'} )
#res = urllib2.urlopen(req)
#dat = res.read()
#res.close()
#BeautifulSoup(dat)
#STEP TWO: Find Description
#if there is a wikipedia page for the entity:
#return first sentence of wikipedia page
#if other site:
#return all sentences that have the keyword "keyword" in them
#STEP THREE: Return Description as "google_search" variable
first_row["Company_Description"] = google_search
writer.writerow(first_row)
first_row = next_row
if __name__ == "__main__":
main()
附录
对于任何从事这项工作或正在研究它的人,我想出了一个我还在完成的次优解决方案。但我想我会发布它,以防它来帮助其他任何人来到这个页面。基本上,我只是做了一个初步的步骤来完成维基百科中的所有搜索,而不是处理找到要选择的网页的问题。这不是我想要的,但至少它会使得更容易获得实体的子集。代码分为两个文件(Wikipedia.py和wiki_test.py):
#Wikipedia.py
from BeautifulSoup import BeautifulSoup
import csv
import urllib
import urllib2
import wiki_test
input_csv = "Name.csv"
output_csv = "WIKIPEDIA.csv"
def main():
with open(input_csv, "rb") as infile:
input_fields = ("A", "C", "E", "M", "O", "N", "P", "Y")
reader = csv.DictReader(infile, fieldnames = input_fields)
with open(output_csv, "wb") as outfile:
output_fields = ("A", "C", "E", "M", "O", "N", "P", "Y", "Description")
writer = csv.DictWriter(outfile, fieldnames = output_fields)
writer.writerow(dict((h,h) for h in output_fields))
next(reader)
first_row = next(reader)
for next_row in reader:
print(next_row)
print(first_row["A"])
search_term = first_row["A"]
#print(search_term)
result = wiki_test.wiki(search_term)
first_row["Description"] = result
writer.writerow(first_row)
first_row = next_row
if __name__ == "__main__":
main()
基于此帖Extract the first paragraph from a Wikipedia article (Python)的辅助模块:
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
def wiki(article):
article = urllib.quote(article)
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Google Chrome')] #wikipedia needs this
resource = opener.open("http://en.wikipedia.org/wiki/" + article)
#try:
# urllib2.urlopen(resource)
#except urllib2.HTTPError, e:
# print(e)
data = resource.read()
resource.close()
soup = BeautifulSoup(data)
print soup.find('div',id="bodyContent").p
我只需修复它来处理HTTP 404错误(即找不到页面),此代码适用于任何想要查找维基百科上可用的基本公司信息的人。再说一遍,我宁愿在谷歌搜索中找到适用的东西,并找到相关网站和网站上提到“关键字”的相关部分,但至少这个当前的程序能为我们带来一些东西。