Question

此网站上的html是否可删除？ http://www.customs.go.jp/toukei/srch/indexe.htm?M=03&P=1,2,,,,,,,,1,0,2018,0,5,0,2,271111,,,,,,,,,,1,,,,,,,,,,,,,,,,,,,,,,50

在“ Chrome开发者工具”的“网络”设置中，在“文档”下加载7 htm个文档，除一个是GET请求外，所有其他文档均已加载。一个POST请求响应包含html格式的数据（此文件称为JCWSV03），这是我要访问的数据。不幸的是，当我运行请求时，我得到的HTML与网页中显示的HTML不同。

import pandas as pd
from bs4 import BeautifulSoup as bs
import requests as rq
import urllib.request

url = 'http://www.customs.go.jp/toukei/srch/indexe.htm?M=01&P=1,2,,,,,,,,1,0,2017,0,3,0,2,271111000,,,,,,,,,,1,,,,,,,,,,,,,,,,,,,,,,50'
sauce = urllib.request.urlopen(url).read().decode('utf-8')
soup = bs(sauce, 'lxml')

r2 = rq.post(url) 
soup, r2.text

这也不起作用：

url2 = 'http://www.customs.go.jp/toukei/srch/jccht00p.htm'
parameters = {'Referer' : 'http://www.customs.go.jp/toukei/srch/jccht03e.htm?&P=1,2,,,,,,,,1,0,2018,0,5,0,2,271111,,,,,,,,,,1,,,,,,,,,,,,,,,,,,,,,,50',
'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'}
r3 = rq.post(url, params = parameters)
r3.text

呈现的html是这样的：

(<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
 <html lang="en">
 <head>
 <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
 <meta content="Copyright (C) Ministry of Finance, The Japanese Government" name="copyright"/>
 <meta content="NOINDEX,NOFOLLOW" name="robots"/>
 <meta content="text/css" http-equiv="Content-Style-Type"/>
 <link href="jcc.css" rel="stylesheet" type="text/css"/>
 <title>Trade Statistics ( Search ) :Trade Statistics of Japan Ministry of Finance</title>
 </head>
 <script language="JavaScript" src="display/jccjs00me.js"></script>
 <script language="JavaScript">
    <!--
        window.onerror=null;
    //-->
    </script>
 <body><noscript>
        Unless it turns ON the Javascript function of a browser, search in a site cannot be performed.
    </noscript>
 <frameset cols="*">
 <frame name="FR_M_INFO" src="tope.htm" title="TopPage"/>
 </frameset>
 </body></html>,
 'ï»¿<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">\r\n<html lang="en">\r\n\t<head>\r\n\t\t<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\r\n\t\t<meta name="copyright" content="Copyright (C) Ministry of Finance, The Japanese Government">\r\n\t\t<meta name="robots" content="NOINDEX,NOFOLLOW">\r\n\t\t<meta http-equiv="Content-Style-Type" content="text/css">\r\n\t\t<link href="jcc.css" rel="stylesheet" type="text/css">\r\n\t\t<title>Trade Statistics ( Search ) :Trade Statistics of Japan Ministry of Finance</title>\t\t\r\n\t</head>\r\n\t\r\n\t<SCRIPT LANGUAGE="JavaScript" SRC="display/jccjs00me.js"></SCRIPT>\r\n\t<SCRIPT LANGUAGE="JavaScript">\r\n\t<!--\r\n\t\twindow.onerror=null;\r\n\t//-->\r\n\t</SCRIPT>\r\n\t\r\n\t<noscript>\r\n\t\tUnless it turns ON the Javascript function of a browser, search in a site cannot be performed.\r\n\t</noscript>\r\n\t\r\n\t<FRAMESET COLS="*">\r\n\t\t<FRAME NAME="FR_M_INFO" SRC="tope.htm" title="TopPage">\r\n\t</FRAMESET>\r\n</html>\r\n')

请提供指导！！（我的最终目标是将html解析为bs4并转换为大熊猫；随着时间的流逝而循环）

Answer 1

有一个“ CSV下载”按钮，它会产生此POST请求。用Curl复制请求并解析CSV数据：

POST /JCWSV03/servlet/JCWSV03 HTTP/1.1
Host: www.customs.go.jp
Connection: keep-alive
Content-Length: 1327
Cache-Control: max-age=0
Origin: http://www.customs.go.jp
Upgrade-Insecure-Requests: 1
DNT: 1
Content-Type: application/x-www-form-urlencoded
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Referer: http://www.customs.go.jp/JCWSV03/servlet/JCWSV03
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.9
Cookie: JSESSIONID=UBNQ8NK54PAUN3QUGUN5R3H2IK3QJJ9H7Q8DQ3VJV805T740E70SKKJ4DLI02000A8000000.JCWSV03_001; visid_incap_763612=S8FIHQm2Tgap/mXyryhoy+7RPlsAAAAAQUIPAAAAAACi+fyzQ2Gk1dOZsySNYdbt; incap_ses_208_763612=IwFUalbIKRMxrKSSFPjiAu/RPlsAAAAAZNe3OqD0RhBl1jCtr3682w==

如果您需要有关使用Curl在Python中执行此操作的帮助，请发表评论，我将其汇总在一起

Answer 2

嘘！！

<h1>THE <br> SILK <br> ROAD</h1>

POST请求python HTML响应未呈现

2 个答案: