尝试抓取页面,但缺少

时间:2020-10-01 16:22:37

标签: python web-scraping beautifulsoup

index_cd = 'KPI200'
page_n = 1
naver_index = 'http://finance.naver.com/sise/sise_index_day.nhn?code' + index_cd + '&page=' + str(page_n)

from urllib.request import urlopen
source = urlopen(naver_index).read()
import bs4
source = bs4.BeautifulSoup(source, 'lxml')
td = source.find_all('td')
len(td)
# /html/body/div/table[1]/tbody/tr[3]/td[1]  # this is XPath
source.find_all('table')[0].find_all('tr')[2].find_all('td')[0]

我认为输出将是这样的:<td class="date">2020.09.29</td>

但是结果是:<td class="date"> </td>

'\xa0'<td class="date"之间有一个</td>

我需要提取那个日期。如何解决这种情况?

1 个答案:

答案 0 :(得分:1)

问题在于您提供的cloudList += tempList.filter{ localObject -> cloudList.all{ it.id != localObject.id } } 。您在url之后错过了=

code更改为naver_index = 'http://finance.naver.com/sise/sise_index_day.nhn?code' + index_cd + '&page=' + str(page_n)

这是工作代码:

naver_index = 'http://finance.naver.com/sise/sise_index_day.nhn?code=' + index_cd + '&page=' + str(page_n)

输出:

index_cd = 'KPI200'
page_n = 1
naver_index = 'http://finance.naver.com/sise/sise_index_day.nhn?code=' + index_cd + '&page=' + str(page_n)

from urllib.request import urlopen
source = urlopen(naver_index).read()
import bs4
source = bs4.BeautifulSoup(source, 'lxml')
td = source.find_all('td')
len(td)
# /html/body/div/table[1]/tbody/tr[3]/td[1]  # this is XPath
print(source.find_all('table')[0].find_all('tr')[2].find_all('td')[0])

如果您只希望显示日期,则将最后一行更改为:

<td class="date">2020.09.29</td>

输出:

print(source.find_all('table')[0].find_all('tr')[2].find_all('td')[0].text)

希望对您有帮助!