Web Crawler不在嵌套div中工作

时间:2015-08-31 11:14:25

标签: python web-crawler

enter image description here

我正在尝试制作一个能够吸引人们兴趣的网络抓取工具。这是代码:

import requests
from bs4 import BeautifulSoup

def facebook_spider():
    url = 'https://www.facebook.com/abhas.mittal7'
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text , "html.parser")
    for div in soup.findAll('div', attrs={'class':'mediaRowWrapper'}):
        print div.text

facebook_spider()

没有显示任何结果。但是,如果我键入不同的div类(位于页面顶部的div),则会显示内容。我认为嵌套div存在一些问题,但我在示例html页面中尝试了这个代码,嵌套的div太多了,它有效。请帮助。

1 个答案:

答案 0 :(得分:0)

看看是否有效:

import urlparse,urllib,codecs
from bs4 import BeautifulSoup

url = 'https://www.facebook.com/abhas.mittal7'
html=urllib.urlopen(url)
htmltext=html.read() 

def gettext(htmltext):
soup=BeautifulSoup(htmltext)
for script in soup(["script", "style"]):
    script.extract()#removing styles and scripts

text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
text = '\n'.join(chunk for chunk in chunks if chunk)

# return text.encode('utf-8') #or print it or whatever you see fit


gettext(htmltext)