我正在尝试从网页上获取文字,但是,在我获得网页网址后,我很难继续下一步,因为我不知道如何处理BeautifulSoup
import urllib
from bs4 import BeautifulSoup
import xml.dom.minidom
keyWord = raw_input("Enter the key-word : ")
address = "http://openapi.naver.com/search?key=c1b406b32dbbbbeee5f2a36ddc14067f&query=" + keyWord + "&display=5&start=1&target=kin&sort=sim"
search_result = urllib.urlopen(address)
raw_data = search_result.read()
parsed_result = xml.dom.minidom.parseString(raw_data)
links = parsed_result.getElementsByTagName('link')
extracted_URL = links[0].firstChild.nodeValue
page = urllib.urlopen(extracted_URL).read()
答案 0 :(得分:3)
您需要使用xml
标记初始化BeautifulSoup
对象:
import urllib
from bs4 import BeautifulSoup
keyWord = raw_input("Enter the key-word : ")
address = "http://openapi.naver.com/search?key=c1b406b32dbbbbeee5f2a36ddc14067f&query=" + keyWord + "&display=5&start=1&target=kin&sort=sim"
soup = BeautifulSoup(urllib.urlopen(address), 'xml')
print [link.text for link in soup.find_all('link')]
打印(对于test
关键字):
[u'http://search.naver.com',
u'http://openapi.naver.com/l?AAAA3IOQ6AIBRF0dVIaQQUq1/YuA+GRzDECb8m7F5uTnXvF6US42HB9QLl7RAZlbx042CcVsG1AExRWW1C8LL9OYpUECkxX51eOrU2D2zxqT/sh9L7c/8BHpFL8lsAAAA=',
...
]
此外,值得浏览文档中的Quick Start章节。
希望有所帮助。