网址中^符号的含义是什么?
我需要从网页抓取一些链接数据,而我正在使用一个简单的手写PHP抓取工具。爬行器通常工作正常;然后我来到这样的网址:
http://www.example.com/example.asp?x7=3^^^^^select%20col1,col2%20from%20table%20where%20recordid%3E=20^^^^^
此URL在浏览器中输入时工作正常但我的抓取工具无法检索此页面。我收到“HTTP请求失败错误”。
答案 0 :(得分:8)
^
个字符,请参阅RFC 1738 Uniform Resource Locators (URL):
其他角色不安全因为 网关和其他运输代理商 众所周知,有时会修改这样的 字符。这些字符是“{”, “}”,“|”,“\”,“^”,“〜”,“[”,“]”, 和“`”。
所有不安全的角色必须始终 在URL中编码
您可以尝试对^
字符进行网址编码。
答案 1 :(得分:7)
根据上下文,我猜他们是一个朴素的尝试对引号进行URL编码。
答案 2 :(得分:6)
Caret(^)不是URL中的保留字符,因此 可以接受使用原样。但是,如果您遇到问题,只需将其替换为十六进制编码%5E
。
是的,将原始SQL放入URL就像是一个闪烁的霓虹灯,上面写着“开心我!”。
答案 3 :(得分:4)
Caret既不是保留也不是“未保留”,这使得它在URL中成为“不安全的角色”。它们永远不会出现在未编码的URL中。来自RFC2396:
2.2. Reserved Characters
Many URI include components consisting of or delimited by, certain
special characters. These characters are called "reserved", since
their usage within the URI component is limited to their reserved
purpose. If the data for a URI component would conflict with the
reserved purpose, then the conflicting data must be escaped before
forming the URI.
reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
"$" | ","
The "reserved" syntax class above refers to those characters that are
allowed within a URI, but which may not be allowed within a
particular component of the generic URI syntax; they are used as
delimiters of the components described in Section 3.
Characters in the "reserved" set are not reserved in all contexts.
The set of characters actually reserved within any given URI
component is defined by that component. In general, a character is
reserved if the semantics of the URI changes if the character is
replaced with its escaped US-ASCII encoding.
2.3. Unreserved Characters
Data characters that are allowed in a URI but do not have a reserved
purpose are called unreserved. These include upper and lower case
letters, decimal digits, and a limited set of punctuation marks and
symbols.
unreserved = alphanum | mark
mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
Unreserved characters can be escaped without changing the semantics
of the URI, but this should not be done unless the URI is being used
in a context that does not allow the unescaped character to appear.
2.4. Escape Sequences
Data must be escaped if it does not have a representation using an
unreserved character; this includes data that does not correspond to
a printable character of the US-ASCII coded character set, or that
corresponds to any US-ASCII character that is disallowed, as
explained below.
答案 4 :(得分:0)
抓取工具可能正在使用正则表达式来解析URL,因此插入符号(^)表示行开头。我认为这些URL实际上是不好的做法,因为它们暴露了底层数据库结构;谁写了这个可能想要考虑严重的重构!
HTH!