HTTP请求的正则表达式在某些情况下无效

时间:2016-09-10 21:57:05

标签: java regex http

我正在使用java服务器来处理仅使用Socket类的HTTP请求,因为我的教授说我们无法使用HTTP库(因为我们的目标是学习HTTP ...)。所以,我决定使用正则表达式处理请求。在代码上发生的第一件事是它获取请求的每一行并将其转换为一个我用模式处理的字符串。我只需要实现以下案例: GET POST PUT HEAD DELETE 。我正在使用应用 Postman 这一Google Chrome扩展程序来测试我的程序。以下是我将邮件变成单个字符串后来自邮递员的一些请求示例:

得到:

  

GET / HTTP / 1.1主机:127.0.0.1:15000连接:keep-alive Cache-Control:no-cache用户代理:Mozilla / 5.0(X11; Linux x86_64)AppleWebKit / 537.36(KHTML,类似Gecko)Chrome /53.0.2785.101 Safari / 537.36 Postman-Token:dd87e652-2b21-3632-30ad-ace26581d369接受: / 接受编码:gzip,deflate,sdch Accept-Language:en-US,en; q = 0.8

没有身体的帖子:

  

POST / HTTP / 1.1主机:127.0.0.1:15000连接:keep-alive内容长度:0缓存控制:无缓存原产地:chrome-extension:// fhbjgbiflinjbdggehcddcbncdddomop用户代理:Mozilla / 5.0(X11 ; Linux x86_64)AppleWebKit / 537.36(KHTML,类似Gecko)Chrome / 53.0.2785.101 Safari / 537.36 Postman-Token:8094b5ce-4b3d-cee7-2d10-f5dd2bc6b7b2接受: / Accept-Encoding:gzip, deflate Accept-Language:en-US,en; q = 0.8

张贴身体:

  

POST / HTTP / 1.1主机:127.0.0.1:15000连接:keep-alive内容长度:9邮递员令牌:3fb2f5e0-2df1-5af4-7853-e9de84648dd5缓存控制:无缓存原点:chrome-extension :// fhbjgbiflinjbdggehcddcbncdddomop用户代理:Mozilla / 5.0(X11; Linux x86_64)AppleWebKit / 537.36(KHTML,类似Gecko)Chrome / 53.0.2785.101 Safari / 537.36内容类型:text / plain; charset = UTF-8接受:< em> / Accept-Encoding:gzip,deflate Accept-Language:en-US,en; q = 0.8

等等...

我写的模式是:

    String somethingPattern = "(.*)?";

    String ipPattern = "(((2[0-4][0-9])|(25[0-5])|(1?[0-9]?[0-9]))\\.((2[0-4][0-9])|(25[0-5])|(1?[0-9]?[0-9]))\\.((2[0-4][0-9])|(25[0-5])|(1?[0-9]?[0-9]))\\.((2[0-4][0-9])|(25[0-5])|(1?[0-9]?[0-9]))|"+somethingPattern+")((:)\\d{3,})?"; // regex for ip varying from 0.0.0.0 to 255.255.255.255 or some string, followed or no by : and a port number 
    String objetoPattern = "([/?a-zA-Z0-9\\.\\-_]+)"; // regex for a linux path to a file, including only letters, numbers and -_.

    String connectionPattern = "(connection:\\s*"+somethingPattern+")?";
    String contentLenPattern = "(content-length:\\s*([0-9]+))?";
    String postmanTokenPattern = "(postman-token:\\s*"+somethingPattern+")?";
    String cacheControlPattern = "(cache-control:\\s*"+somethingPattern+")?";
    String originPattern = "(origin:\\s*"+somethingPattern+")?";
    String userAgentPattern = "(user-agent:\\s*"+somethingPattern+")?";
    String charsetPattern = "(charset="+somethingPattern+")?";
    String contentTypePattern = "(content-type:\\s*"+somethingPattern+";"+charsetPattern+")?";
    String acceptPattern = "(accept:\\s*"+somethingPattern+")?";
    String acceptEncodingPattern = "(accept-encoding:\\s*"+somethingPattern+")?";
    String acceptLanguagePattern = "(accept-language:\\s*"+somethingPattern+")?";


    // (?i) is for the case of coming get, Get, GET... etc...
    String pattern = "^(?i)(get|put|head|post|delete)\\s+?" + objetoPattern + "\\s+?HTTP/1.1\\s+?host:\\s+?" + ipPattern + "\\s+?" + connectionPattern + "\\s+?" + contentLenPattern + "\\s+?" + postmanTokenPattern + "\\s+?" + cacheControlPattern + "\\s+?" + originPattern + "\\s+?" + userAgentPattern + "\\s+?" + contentTypePattern + "\\s+?" + acceptPattern + "\\s+?" + acceptEncodingPattern + "\\s+?" + acceptLanguagePattern + "\\s+?$";

正则表达式匹配和分组很好,大部分请求除了 GET HEAD POST没有正文即可。我不知道为什么会这样。我在每个模式的末尾添加?,例如,origincontent-length或类似请求中不存在的情况。但即使它不符合这些情况。匹配代码的一部分是:

Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(in); // this in is the input string that is the request all joined in a single line string

if(m.find()){
// ......
} else {
  System.out.println("Input didn't match");
}

编辑:处理来自Socket的输入的代码部分:

bufferedReader = new BufferedReader(new InputStreamReader(socket.getInputStream()));

        String in = "";
        while((msgDoSocket = bufferedReader.readLine()) != null){
            try {
                in += msgDoSocket + " ";
                if(msgDoSocket.isEmpty()){
                    processaInput(in); // this calls the part that process regex
                }
            } catch (Exception ex) {
                Logger.getLogger(ServerThread.class.getName()).log(Level.SEVERE, null, ex);
            }
        }

1 个答案:

答案 0 :(得分:2)

标题行由换行符分隔,标题与正文(如果存在)分开,并有2个连续的换行符。 您应该使用Scanner对象,因为默认情况下使用换行符来分隔令牌,比Matcher更容易。你可以简单地遍历这些行。获得这些标题后,您可以使用':'对它们进行切片,以形成Map而不是百万种类型的变量,以涵盖所有标题键的可能性。然后,您只需检查地图键值以匹配您发送的内容。

您也可以使用Fiddler / Wireshark查看邮递员的原始请求。

This使用读者回答并做同样的事情。