作为练习,我正在努力打印来自网站reddit.com的超过200条评论的所有帖子标题。
我尝试了什么:
import requests
from bs4 import BeautifulSoup
url1 = "https://www.reddit.com/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
res = requests.get(url1, headers=headers)
res.raise_for_status()
soup = BeautifulSoup(res.content, "html5lib")
g = soup.select('ul > li.first')
j = soup.select('#siteTable div.entry.unvoted > p.title > a ')
list1 = []
for t in j:
list.append(t.text)
list2=[]
for s in g:
for p in s.text.split(" "):
if p.isdigit():
p = int(p)
if p > 100:
list2.append(p)
for q,l in zip(list1, list2):
if l > 200:
print(q,l)
问题:
它工作到一半,直到打嗝到某处并且列表不再匹配。结果,我得到的评论少于200条。
输出:
What the F David Blaine!! 789
So NYC MTA (subway) banned all dogs unless the owner carries them in a bag. I think this owner nailed it. 1075
Bad to the bone 307
TIL there is a "white man" café in Tokyo, where Japanese ladies ring a bell to summon tuxedo-wearing caucasians who respond with "yes, princess?" and serve them cake 2145
Earthquake Warning Issued in California 1410
Man impersonating officer busted for attempting to pull over unmarked cruiser 1022
Use of body-worn cameras sees complaints against police ‘virtually vanish’, study finds 2477
Amazing one handed interception 759
A purrfectly executed leap 518
"This bed has a fur pillow, I'll lay here." 792
Back in 'Nam, 1969. Guy on the left is a good friend of mine's dad. He's in hospice now and not doing well but he'll live on in photos. 264
Nintendo Entertainment System: NES Classic Edition - with 30 games - Available in US 11/11/16 290
A scenic view ruined by a drunk driver (Star Wars: Battlefront) 2737
Clouds battling a sunset over Olympic National Park, WA, USA (1334x750) [OC] 2222
What company is totally guilty of false advertising and why? 2746
South Korean President Park Geun-hye has called on North Koreans to abandon their country and defect, just a day after a soldier walked across the heavily fortified border into the South 410
TIFU by underestimating the stupidity of multiple people 334
Special Trump burger at a burger chain in South Africa 311
This Special Ed Teacher Had All of Her Students in Her Wedding 984
在“被...破坏的风景”
之后,杂乱开始了有人能指出我这里的确切问题或替代方法吗?
答案 0 :(得分:0)
而不是先将它保存到列表中并希望两个列表匹配(list1 [0] ~~ list2 [0])....我试图找到最常见的分母(父)并应用类选择(beautifulsoup)第二次深入研究dom(儿童)并立即打印出来。在刮掉像reddit这样频繁使用的网站的时候,即使将它们分开几秒钟也很有可能进行更改,并且在将数据保存到列表并进行比较时可能会出现打嗝的原因。
解决方案:
import requests
from bs4 import BeautifulSoup
url1 = "https://www.reddit.com/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
res = requests.get(url1, headers=headers)
res.raise_for_status()
soup = BeautifulSoup(res.content, "html5lib")
k= soup.select('#siteTable div.entry.unvoted') # partent
for v in k:
d = v.select('ul > li.first') #comment
o = v.select('p.title > a') #title
for z,x in zip(d,o):
for p in z.text.split(" "): # convert "351 comments" to integer "351" and compare with 200
if p.isdigit():
p = int(p)
if p > 200:
print(z.text, x.text) #print comments first then title