我是网络编程的初学者,很抱歉,如果这是一个非常基本的东西,但无法找到与stackoverflow中的问题一样具体的东西。 所以我有很多文本文件(10k),我需要上传到这个网站https://rostlab.org/services/nlsdb/,然后点击"评估NES / NLS"。这会触发SQL查询并返回表格形式的一些信息。然后我需要点击" CSV"按钮将文件下载到我的电脑。 当然我不想手动上传每个文件,所以我试图用Python生成请求但是不能完成它,我甚至没有到达表的位置来自初始网站的回复,因此下载CSV是我尚未遇到的挑战:
import requests
url = 'https://rostlab.org/services/nlsdb/query'
files = {'file-upload': ('some.txt', open('C:\\some.txt', 'rb'), 'text/plain')}
data = {'_token':'', 'input-data':'', 'query-sig2':''}
r = requests.post('https://rostlab.org/services/nlsdb/query', files=files, data=data)
作为回复,我收到大量文本,我可以从HTML中恢复错误代码500,所以我肯定在这里做错了我不能看到什么。我提交文件时来自网站的POST请求如下所示:
**General**
Request URL:https://rostlab.org/services/nlsdb/query
Request Method:POST
Status Code:200 OK
Remote Address:131.159.28.73:443
Referrer Policy:no-referrer-when-downgrade
Response Headers
Cache-Control:no-cache, private
Connection:Keep-Alive
Content-Encoding:gzip
Content-Length:2231
Content-Type:text/html; charset=UTF-8
Date:Thu, 08 Feb 2018 12:39:30 GMT
Keep-Alive:timeout=5, max=100
Server:Apache
Set-Cookie:nlsdb_session=eyJpdiI6IjZMRk03ZjRCNjBmU1JcL3Y0Vko4ZHFRPT0iLCJ2YWx1ZSI6Ikh2bHcyZHBuN25nNmx1QnRoOFlPMWhWU0RYdUpEdnAwbGtySWgwbDlDVElHZmRyNlBMeEdXT3ROSERcLzRRNDB2ZnVUQ2oyTDlmOVRHa3JNUUZJTnBkUT09IiwibWFjIjoiZWM3ZjFjYmQ2ZThkNmRlM2JmOTY5OWZiYWMxOTA4ZmZiZjcxZjU1ODJjNjU1ODgzYjczMmUxMGY1NGMwMjNlMCJ9; expires=Thu, 08-Feb-2018 14:39:30 GMT; Max-Age=7200; path=/; httponly
Set-Cookie:XSRF-TOKEN=eyJpdiI6IjExMjBaRHNmWHVLZTBzSURYZFwvUmF3PT0iLCJ2YWx1ZSI6InQyWUE5QzZEd2xmZU5rMjlyekV1Z2JcL3lGNkNvbHl1TnBHMVh5eWtLeWtNb3JHcTJJSFpyR0lDVkxNV2h2cGsrTUhYMGl3ZDBET0hucHdpNzV0YkRpdz09IiwibWFjIjoiNzcxODBhYjIzYjEzNDU1OTNhNGRhNjI3OTAxNWY1MjFkYjI5MWQ5NjgwNGE4ZjVmMzQzZThkNWUzZWE0YTgwYSJ9; expires=Thu, 08-Feb-2018 14:39:30 GMT; Max-Age=7200; path=/
Vary:Accept-Encoding
Request Headers
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding:gzip, deflate, br
Accept-Language:en-US,en;q=0.9
Cache-Control:max-age=0
Connection:keep-alive
Content-Length:1943
Content-Type:multipart/form-data; boundary=----WebKitFormBoundary1tOuJdyWl1bn7H4X
Cookie:XSRF-TOKEN=eyJpdiI6IjZaWHdTa3FPYmNHbkxsNVpoUlE3T0E9PSIsInZhbHVlIjoiQWMraGlLekd1akkrc0RDTzNMRGNIcVFkVGdBNjZFa2h4XC8xcUI0VmtIVG9CTnVPNW1IUW55NU9iNGlGY0NCWkFkd0hDZnJOaXBaT3J0VHZTSXl6b1FBPT0iLCJtYWMiOiJmMjE3N2JkZDIyMjRkNTY3ZGE4MDhlNGY5OWJiMDAwYjNiNzYyNGJjMTc2YzA4NTQwODcxZTM3YjI0YjQ5MWUyIn0%3D; nlsdb_session=eyJpdiI6IjByb2dtS0Q1ekFBU1F0WURJUk8rWnc9PSIsInZhbHVlIjoiM3lMNFU5Y2hBXC9BVU0xT0RUNnhVaUJ0ckJ0RnB5QlJqbk15alNSNkM4MjhNTGd6TFwvR0dwd0ZpWE9pU3piekhWb3ZzQjNZYVQ4ODdHeUxUMVJWM0pwUT09IiwibWFjIjoiYTE1Y2Q2NmRlN2M4Yjc1MzEyZTQxYjcwMzVmYjNiNjA1YjdiNjU4ODkxZWJhM2JmYTAwYTk1MWNhZWNkNTczMiJ9
DNT:1
Host:rostlab.org
Origin:https://rostlab.org
Referer:https://rostlab.org/services/nlsdb/
Upgrade-Insecure-Requests:1
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36
Request Payload
------WebKitFormBoundary1tOuJdyWl1bn7H4X
Content-Disposition: form-data; name="_token"
GnjGT2Ejrrpo4Nlf2EbwtmLtY29GNFnoTJpl5z5o
------WebKitFormBoundary1tOuJdyWl1bn7H4X
Content-Disposition: form-data; name="input-data"
------WebKitFormBoundary1tOuJdyWl1bn7H4X
Content-Disposition: form-data; name="file-upload"; filename="some.txt"
Content-Type: text/plain
------WebKitFormBoundary1tOuJdyWl1bn7H4X
Content-Disposition: form-data; name="query-sig2"
sF4MZkIaMc1K9TPZ6uYJuQ
------WebKitFormBoundary1tOuJdyWl1bn7H4X--
我认为数据对象不正确,但我无法做到正确,省略它似乎也不起作用。有关如何正确检索数据,然后下载相应的csv文件的任何建议吗?
答案 0 :(得分:0)
该网站使用cross-site scripting tokens来防范常见的攻击类别。此外,他们还使用生成的令牌作为提交按钮。
为了能够发布任何内容,您需要:
我使用BeautifulSoup来解析表单页面并提取标记:
+---+----+
| id|name|
+---+----+
| 12| cdf|
| 11| abc|
+---+----+
请注意,我还提取了from bs4 import BeautifulSoup
import requests
form_url = 'https://rostlab.org/services/nlsdb/'
with requests.session() as sess:
response = sess.get(form_url)
soup = BeautifulSoup(response.content, 'html.parser')
csrf_token = soup.find('input', {'name': '_token'})['value']
submit_token = soup.find('button', id='submit-sig2')['value']
action_url = soup.find('form', id='input-form')['action']
data = {'_token': csrf_token, 'query-sig2': submit_token, 'input-data':''}
with open('C:\\some.txt', 'rb') as some_text:
files = {'file-upload': ('some.txt', some_text, 'text/plain')}
response = sess.post(action_url, data=data, files=files)
标记的action
属性;最好坚持服务器告诉我们使用的内容。
以上代码生成200 OK响应,其中HTML页面列出了表格中的匹配结果。