python请求缺少部分内容

时间:2018-11-21 02:30:31

标签: python web-scraping beautifulsoup request web-crawler

我正在从网站(https://www.104.com.tw/job/?jobno=66wee)抓取工作内容。发送请求时,只返回“ p”元素中的部分内容。我需要所有div class =“ content”部分。

我的代码:

PropertyCheckConfig

结果(缺少工作描述部分):

maxDiscarded

但是这部分的html代码是:

  import requests
  from bs4 import BeautifulSoup

  payload = {'jobno':'66wee'}
  headers = {'user-agent': 'Mozilla/5.0 (Macintosh Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'}
  r = requests.get('https://www.104.com.tw/job/',params = payload,headers = headers)
  soup=  BeautifulSoup(r.text, 'html.parser')
  contents = soup.findAll('div',{'class':'content'})  
  desctiprion = contents[0].findAll('p')[0].text.strip()
  print(desctiprion)

3 个答案:

答案 0 :(得分:0)

您仅使用第二个p索引访问第一个[0]元素:

description = contents[0].findAll('p')[0].text.strip()

您应该遍历所有p元素:

description = ""
for p in contents[0].findAll('p'):
    description += p.text.strip()

print(description)

答案 1 :(得分:0)

import requests
from bs4 import BeautifulSoup

payload = {'jobno': '66wee'}
headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'}
r = requests.get('https://www.104.com.tw/job/',
                 params=payload, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
contents = soup.findAll('div', {'class': 'content'})
for content in contents[0].findAll('p')[0].text.splitlines():
    print(content)

答案 2 :(得分:0)

第一个content类标记中包含更多内容,但假设您要一直到第4点的末尾,即第一个子p标记,则可以将后代组合器与带有类选择器的父元素一起使用和子元素选择器。如果您确实想要所有内容,请从选择器中删除p

import requests
from bs4 import BeautifulSoup

url = 'https://www.104.com.tw/job/?jobno=66wee'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
s = soup.select_one('.content p').text
print(s)