使用urllib从服务器获取Html

时间:2016-02-17 12:13:59

标签: python web-scraping

我正在尝试从具有表单的网站获取数据。现在我正在使用urllib用于此目的。表格有三个字段,因此我需要提供这三个值。但我得到的是响应是空格式的代码,而我期待一个对应于我的输入值的表格 - 这是我的代码片段

// Make sure we got a filename on the command line.
if (process.argv.length < 3) {
   console.log('Usage: node ' + process.argv[1] + ' FILENAME');
   process.exit(1);
}
// Read the file and print its contents. And split into an array after each space
var fs    = require('fs') , filename = process.argv[2];
var array = fs.readFileSync('dependencies.txt').toString().split('\n');
//console.log(array[0]);

if(process.argv[3]){
    var test = process.argv[3];
    for(var i = 0; i < array.length; i++ ){
       var pattern = /([^\s]+)/g;
       var line = pattern.exec(array[i]);
       if(test == line[0]){
          console.log(array[i]);
       }
    }
}


fs.readFile(filename, 'utf8', function(err, data) {
   if (err) throw err;
   console.log('OK: ' + filename);
   console.log(data)
});

值是给予表单的输入值。我做错了什么?虽然我不发送会话ID。

2 个答案:

答案 0 :(得分:0)

试试这个 http://wwwsearch.sourceforge.net/mechanize/

import re   
import mechanize
br = mechanize.Browser()
br.open("http://www.example.com/")
 # follow second link with element text matching regular expression
response1 = br.follow_link(text_regex=r"cheese\s*shop", nr=1)
assert br.viewing_html()
print br.title()
print response1.geturl()
print response1.info()  # headers
print response1.read()  # body

br.select_form(name="order")
# Browser passes through unknown attributes (including methods)
# to the selected HTMLForm.
br["cheeses"] = ["mozzarella", "caerphilly"]  # (the method here is __setitem__)
# Submit current form.  Browser calls .close() on the current response on
# navigation, so this closes response1
response2 = br.submit()

# print currently selected form (don't call .submit() on this, use br.submit())
print br.form

response3 = br.back()  # back to cheese shop (same data as response1)
# the history mechanism returns cached response objects
# we can still use the response, even though it was .close()d
response3.get_data()  # like .seek(0) followed by .read()
response4 = br.reload()  # fetches from server

答案 1 :(得分:0)

实际上问题是我没有在标题中发送足够的信息。所以我切换到“请求”。我将此请求分为两个阶段 -

  • 首先从服务器获取cookie
  •  
  • 秒获取包含第一次请求中收到的cookie的数据

然后服务器回复了正确的所需数据。