返回仅包含20个条目的列表。不要超越这个

时间:2018-06-01 18:47:49

标签: python python-3.x web-scraping beautifulsoup

#importing the libraries
import urllib.request as urllib2

from bs4 import BeautifulSoup

#getting the page url
quote_page="https://www.quora.com/What-is-the-best-advice-you-can-give-to-a-junior-programmer"

page=urllib2.urlopen(quote_page)

#parsing the html
soup = BeautifulSoup(page,"html.parser")

# Take out the <div> of name and get its value
name_box = soup.find("div", attrs={"class": "AnswerListDiv"})

#finding all  the tags in the page
ans=name_box.find_all("div", attrs={"class": "u-serif-font-main--large"},recursive=True)

#separating the answers into lists
for i in range(0, len(ans), 100):
    chunk = ans[i:i+100]

#extracting all the answers and putting into a list 
finalans=[]
l=0
for i in chunk:
    stri=chunk[l]
    finalans.append(stri.text)
    l+=1
    continue

final_string = '\n'.join(finalans)

#final output
print(final_string)

我无法在此列表中获得超过20个条目。这段代码有什么问题? (我是初学者,我使用了一些参考来编写这个程序) 编辑:我添加了我想要抓取的网址。

1 个答案:

答案 0 :(得分:0)

您尝试将ans分解为较小的块,但请注意,此循环的每次迭代都会丢弃先前的chunks内容,因此您将丢失除最后一块数据之外的所有内容。

#separating the answers into lists
for i in range(0, len(ans), 100):
    chunk = ans[i:i+100]    # overwrites previous chunk

这就是为什么你只能在列表中获得20个项目...它只是最后一个块。由于您希望final_string保存所有文本节点,因此不需要块,我只是将其删除。

接下来,这只是收紧代码,你不需要迭代列表的值并跟踪索引只是为了得到你正在索引的相同值。处理ans,因为我们不再是分块,

finalans=[]
l=0
for i in ans:
    stri=ans[l]
    finalans.append(stri.text)
    l+=1
    continue

变为

finalans=[]
for item in ans:
    finalans.append(item.text)

或更隐蔽

finalans = [item.text for item in ans]

所以程序是

#importing the libraries
import urllib.request as urllib2

from bs4 import BeautifulSoup

#getting the page url
quote_page="https:abcdef.com"

page=urllib2.urlopen(quote_page)

#parsing the html
soup = BeautifulSoup(page,"html.parser")

# Take out the <div> of name and get its value
name_box = soup.find("div", attrs={"class": "AnswerListDiv"})

#finding all  the tags in the page
ans=name_box.find_all("div", attrs={"class": "u-serif-font-main--large"},recursive=True)

#extracting all the answers and putting into a list 
finalans = [item.text for item in ans]

final_string = '\n'.join(finalans)

#final output
print(final_string)