此网站上的html是否可删除? http://www.customs.go.jp/toukei/srch/indexe.htm?M=03&P=1,2,,,,,,,,1,0,2018,0,5,0,2,271111,,,,,,,,,,1,,,,,,,,,,,,,,,,,,,,,,50
在“ Chrome开发者工具”的“网络”设置中,在“文档”下加载7 htm个文档,除一个是GET请求外,所有其他文档均已加载。一个POST请求响应包含html格式的数据(此文件称为JCWSV03),这是我要访问的数据。不幸的是,当我运行请求时,我得到的HTML与网页中显示的HTML不同。
import pandas as pd
from bs4 import BeautifulSoup as bs
import requests as rq
import urllib.request
url = 'http://www.customs.go.jp/toukei/srch/indexe.htm?M=01&P=1,2,,,,,,,,1,0,2017,0,3,0,2,271111000,,,,,,,,,,1,,,,,,,,,,,,,,,,,,,,,,50'
sauce = urllib.request.urlopen(url).read().decode('utf-8')
soup = bs(sauce, 'lxml')
r2 = rq.post(url)
soup, r2.text
这也不起作用:
url2 = 'http://www.customs.go.jp/toukei/srch/jccht00p.htm'
parameters = {'Referer' : 'http://www.customs.go.jp/toukei/srch/jccht03e.htm?&P=1,2,,,,,,,,1,0,2018,0,5,0,2,271111,,,,,,,,,,1,,,,,,,,,,,,,,,,,,,,,,50',
'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'}
r3 = rq.post(url, params = parameters)
r3.text
呈现的html是这样的:
(<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="Copyright (C) Ministry of Finance, The Japanese Government" name="copyright"/>
<meta content="NOINDEX,NOFOLLOW" name="robots"/>
<meta content="text/css" http-equiv="Content-Style-Type"/>
<link href="jcc.css" rel="stylesheet" type="text/css"/>
<title>Trade Statistics ( Search ) :Trade Statistics of Japan Ministry of Finance</title>
</head>
<script language="JavaScript" src="display/jccjs00me.js"></script>
<script language="JavaScript">
<!--
window.onerror=null;
//-->
</script>
<body><noscript>
Unless it turns ON the Javascript function of a browser, search in a site cannot be performed.
</noscript>
<frameset cols="*">
<frame name="FR_M_INFO" src="tope.htm" title="TopPage"/>
</frameset>
</body></html>,
'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">\r\n<html lang="en">\r\n\t<head>\r\n\t\t<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\r\n\t\t<meta name="copyright" content="Copyright (C) Ministry of Finance, The Japanese Government">\r\n\t\t<meta name="robots" content="NOINDEX,NOFOLLOW">\r\n\t\t<meta http-equiv="Content-Style-Type" content="text/css">\r\n\t\t<link href="jcc.css" rel="stylesheet" type="text/css">\r\n\t\t<title>Trade Statistics ( Search ) :Trade Statistics of Japan Ministry of Finance</title>\t\t\r\n\t</head>\r\n\t\r\n\t<SCRIPT LANGUAGE="JavaScript" SRC="display/jccjs00me.js"></SCRIPT>\r\n\t<SCRIPT LANGUAGE="JavaScript">\r\n\t<!--\r\n\t\twindow.onerror=null;\r\n\t//-->\r\n\t</SCRIPT>\r\n\t\r\n\t<noscript>\r\n\t\tUnless it turns ON the Javascript function of a browser, search in a site cannot be performed.\r\n\t</noscript>\r\n\t\r\n\t<FRAMESET COLS="*">\r\n\t\t<FRAME NAME="FR_M_INFO" SRC="tope.htm" title="TopPage">\r\n\t</FRAMESET>\r\n</html>\r\n')
请提供指导!! (我的最终目标是将html解析为bs4并转换为大熊猫;随着时间的流逝而循环)
答案 0 :(得分:0)
有一个“ CSV下载”按钮,它会产生此POST请求。用Curl复制请求并解析CSV数据:
POST /JCWSV03/servlet/JCWSV03 HTTP/1.1
Host: www.customs.go.jp
Connection: keep-alive
Content-Length: 1327
Cache-Control: max-age=0
Origin: http://www.customs.go.jp
Upgrade-Insecure-Requests: 1
DNT: 1
Content-Type: application/x-www-form-urlencoded
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Referer: http://www.customs.go.jp/JCWSV03/servlet/JCWSV03
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.9
Cookie: JSESSIONID=UBNQ8NK54PAUN3QUGUN5R3H2IK3QJJ9H7Q8DQ3VJV805T740E70SKKJ4DLI02000A8000000.JCWSV03_001; visid_incap_763612=S8FIHQm2Tgap/mXyryhoy+7RPlsAAAAAQUIPAAAAAACi+fyzQ2Gk1dOZsySNYdbt; incap_ses_208_763612=IwFUalbIKRMxrKSSFPjiAu/RPlsAAAAAZNe3OqD0RhBl1jCtr3682w==
如果您需要有关使用Curl在Python中执行此操作的帮助,请发表评论,我将其汇总在一起
答案 1 :(得分:0)
嘘!!
<h1>THE <br> SILK <br> ROAD</h1>