#importing the libraries
import urllib.request as urllib2
from bs4 import BeautifulSoup
#getting the page url
quote_page="https://www.quora.com/What-is-the-best-advice-you-can-give-to-a-junior-programmer"
page=urllib2.urlopen(quote_page)
#parsing the html
soup = BeautifulSoup(page,"html.parser")
# Take out the <div> of name and get its value
name_box = soup.find("div", attrs={"class": "AnswerListDiv"})
#finding all the tags in the page
ans=name_box.find_all("div", attrs={"class": "u-serif-font-main--large"},recursive=True)
#separating the answers into lists
for i in range(0, len(ans), 100):
chunk = ans[i:i+100]
#extracting all the answers and putting into a list
finalans=[]
l=0
for i in chunk:
stri=chunk[l]
finalans.append(stri.text)
l+=1
continue
final_string = '\n'.join(finalans)
#final output
print(final_string)
我无法在此列表中获得超过20个条目。这段代码有什么问题? (我是初学者,我使用了一些参考来编写这个程序) 编辑:我添加了我想要抓取的网址。
答案 0 :(得分:0)
您尝试将ans
分解为较小的块,但请注意,此循环的每次迭代都会丢弃先前的chunks
内容,因此您将丢失除最后一块数据之外的所有内容。
#separating the answers into lists
for i in range(0, len(ans), 100):
chunk = ans[i:i+100] # overwrites previous chunk
这就是为什么你只能在列表中获得20个项目...它只是最后一个块。由于您希望final_string
保存所有文本节点,因此不需要块,我只是将其删除。
接下来,这只是收紧代码,你不需要迭代列表的值并跟踪索引只是为了得到你正在索引的相同值。处理ans
,因为我们不再是分块,
finalans=[]
l=0
for i in ans:
stri=ans[l]
finalans.append(stri.text)
l+=1
continue
变为
finalans=[]
for item in ans:
finalans.append(item.text)
或更隐蔽
finalans = [item.text for item in ans]
所以程序是
#importing the libraries
import urllib.request as urllib2
from bs4 import BeautifulSoup
#getting the page url
quote_page="https:abcdef.com"
page=urllib2.urlopen(quote_page)
#parsing the html
soup = BeautifulSoup(page,"html.parser")
# Take out the <div> of name and get its value
name_box = soup.find("div", attrs={"class": "AnswerListDiv"})
#finding all the tags in the page
ans=name_box.find_all("div", attrs={"class": "u-serif-font-main--large"},recursive=True)
#extracting all the answers and putting into a list
finalans = [item.text for item in ans]
final_string = '\n'.join(finalans)
#final output
print(final_string)