我正在寻找一个Web爬网程序,该爬网程序将从论坛中收集主题行。一旦有了它,我想将每个主题显示为新行,并在每行的开头显示[*]。
使用BeautifulSoup,我可以抓取页面并提取跨度类“主题”。但是,从那里我不确定如何只解析主题文本,然后按照我尝试的方式对其进行排序。
import requests
from bs4 import BeautifulSoup
url = "https://boards.4channel.org/sci/"
#send the HTTP request
response = requests.get(url)
if response.status_code == 200:
#pull the content
html_content = response.content
#send the page to BeautifulSoup
html_doc = BeautifulSoup(html_content, "html.parser")
#extract topic data
topic_spider = html_doc.find_all("span",{"class":"subject"})
print topic_spider
搜寻器的当前结果如下:
[<span class="subject"></span>, <span class="subject"></span>, <span class="subject">Cigarettes vs. Cannabis</span>, <span class="subject">Cigarettes vs. Cannabis</span>, <span class="subject"></span>, <span class="subject"></span>, <span class="subject"></span>, <span class="subject"></span>, <span class="subject"></span>...
我正在尝试像这样订购它们:
[*] Topic 1
[*] Topic 2
[*] Topic 3
答案 0 :(得分:1)
检查元素的文本是否不为null,然后删除重复项并对列表进行排序,然后遍历并将[*]
添加到字符串中。
希望您能喜欢这个。如果不让我知道您的预期输出。
import requests
from bs4 import BeautifulSoup
url = "https://boards.4channel.org/sci/"
#send the HTTP request
response = requests.get(url)
if response.status_code == 200:
#pull the content
html_content = response.content
#send the page to BeautifulSoup
html_doc = BeautifulSoup(html_content, "html.parser")
#extract topic data
topic_spider = html_doc.find_all("span",{"class":"subject"})
data=[]
for topic in topic_spider:
if topic.text!='':
data.append(topic.text)
mylist = list(dict.fromkeys(data)) #Remove the duplicates here
mylist.sort(reverse=False) #sort here
for d in mylist:
print ('[*]' + d)
答案 1 :(得分:0)
对:not(:empty)
css伪类使用集合理解。输出已经按字母顺序排列,但是您始终可以调用sort方法
import requests
from bs4 import BeautifulSoup as bs
url = "https://boards.4channel.org/sci/"
r = requests.get(url)
soup = bs(r.content, "lxml")
data = {"[*]" + item.text for item in soup.select('.subject:not(:empty)')}
#data.sort()