以下内容是否足以阻止XSS内部的HTML元素?
function XSS_encode_html ( $str )
{
$str = str_replace ( '&', "&", $str );
$str = str_replace ( '<', "<", $str );
$str = str_replace ( '>', ">", $str );
$str = str_replace ( '"', " "", $str );
$str = str_replace ( '\'', " '", $str );
$str = str_replace ( '/', "/", $str );
return $str;
}
<小时/> 的修改
我没有使用htmlspecialchars(),因为: -
/
'
”(或'
)。根据OWASP,'(单引号)应成为'
(称我为迂腐)并且,
'
不推荐使用,因为它不在HTML规范中
答案 0 :(得分:3)
在元素的内容中,the only character that can be harmful is the start-tag delimiter <
因为它可能表示某个标记声明的开始,无论它是开始标记,结束标记还是注释。所以该角色总是被转义。
其他字符不一定需要在元素内容中进行转义。
引号只需要在标记内进行转义,尤其是在用于包含在相同引号内或根本不引用的属性值时。类似地,标记声明关闭分隔符>
只需要在标记内进行转义,这里仅在不带引号的属性值中使用时。但是,escaping plain ampersands as well is recommended to avoid them being interpreted as start of a character reference by mistake。
现在关于替换/
的原因,可能是由于SGML中的一个功能,改编了标记语言HTML,这允许所谓的null end-tag:
要了解null end-tag在实践中如何工作,请考虑将其与可定义为的元素结合使用:
<!ELEMENT ISBN - - CDATA --ISBN number-- >
而不是输入ISBN编号:
<ISBN>0 201 17535 5</ISBN>
我们可以使用null end-tag选项以缩写形式输入元素:
<ISBN/0 201 17535 5/
但是,我从未见过任何浏览器都实现过此功能。 HTML的语法规则一直比SGML语法规则更严格。
另一个更可能的原因是所谓的raw text elements (script
and style
)的内容模型,它是带有以下restriction的纯文本:
原始文本和RCDATA元素中的文本不得包含任何字符串“
</
”(U + 003C LESS-THAN SIGN,U + 002F SOLIDUS),后跟与字符串不区分大小写的字符元素的名称后面跟着“tab”(U + 0009),“LF”(U + 000A),“FF”(U + 000C),“CR”(U + 000D),U + 0020 SPACE,“>
“(U + 003E)或”/
“(U + 002F)。
在此处说明script
内部原始文本元素</script/
的出现将表示结束标记:
<script>
alert(0</script/.exec("script").index)
</script>
虽然完全有效的JavaScript代码,但结束标记将由</script/
表示。但除此之外,/
不会造成任何伤害。如果你只允许在转义HTML的情况下在JavaScript上下文中使用任意输入,那你就已经注定了。
顺便说一下,这些字符被转义的是哪种character reference无关紧要,无论是命名字符引用(即实体引用),还是数字字符引用,都是十进制或十六进制表示法。它们都引用相同的字符。
答案 1 :(得分:2)
您应该使用htmlspecialchars
:
$str = htmlspecialchars($str, ENT_QUOTES, 'UTF-8');
这是the documentation,基本上可以完成你的功能,但它已经实现了,它更干净。但是,它不会转换斜杠和反斜杠。
如果要使用命名的HTML实体转换每个字符,可以使用htmlentities
:
$str = htmlentities($str, ENT_QUOTES, 'UTF-8');
这里是documented。如果您只想阻止XSS攻击和JS注入,我建议使用前者,因为它的开销要低得多。
答案 2 :(得分:0)
您可以使用stripslashes()函数。
$str = stripslashes($str);
答案 3 :(得分:0)
这是一个很长的问题,但如果我不分享的话,我觉得自己会受到伤害。所有代码都直接取自Drupal最新稳定版本的源代码的各个部分,并编译到一个区域(如下所示)。防止XSS攻击的非常有效的方法。
使用示例:
$html = file_get_contents('http://example.com');
$output = filter_xss($html);
print $output;
或者:
$html = file_get_contents('http://example.com');
// Allow only <ul></ul>, <li></li>, and <p></p> tags.
$allowed_tags = array('ul', 'li', 'p');
$output = filter_xss($html, $allowed_tags);
print $output;
以下是运行上述示例所需的代码:
/**
* Filters HTML to prevent cross-site-scripting (XSS) vulnerabilities.
*
* Based on kses by Ulf Harnhammar, see http://sourceforge.net/projects/kses.
* For examples of various XSS attacks, see: http://ha.ckers.org/xss.html.
*
* This code does four things:
* - Removes characters and constructs that can trick browsers.
* - Makes sure all HTML entities are well-formed.
* - Makes sure all HTML tags and attributes are well-formed.
* - Makes sure no HTML tags contain URLs with a disallowed protocol (e.g.
* javascript:).
*
* @param $string
* The string with raw HTML in it. It will be stripped of everything that can
* cause an XSS attack.
* @param $allowed_tags
* An array of allowed tags.
*
* @return
* An XSS safe version of $string, or an empty string if $string is not
* valid UTF-8.
*
* @see validate_utf8()
* @ingroup sanitization
*/
function filter_xss($string, $allowed_tags = array('a', 'em', 'strong', 'cite', 'blockquote', 'code', 'ul', 'ol', 'li', 'dl', 'dt', 'dd')) {
// Only operate on valid UTF-8 strings. This is necessary to prevent cross
// site scripting issues on Internet Explorer 6.
if (!validate_utf8($string)) {
return '';
}
// Store the text format.
_filter_xss_split($allowed_tags, TRUE);
// Remove NULL characters (ignored by some browsers).
$string = str_replace(chr(0), '', $string);
// Remove Netscape 4 JS entities.
$string = preg_replace('%&\s*\{[^}]*(\}\s*;?|$)%', '', $string);
// Defuse all HTML entities.
$string = str_replace('&', '&', $string);
// Change back only well-formed entities in our whitelist:
// Decimal numeric entities.
$string = preg_replace('/&#([0-9]+;)/', '&#\1', $string);
// Hexadecimal numeric entities.
$string = preg_replace('/&#[Xx]0*((?:[0-9A-Fa-f]{2})+;)/', '&#x\1', $string);
// Named entities.
$string = preg_replace('/&([A-Za-z][A-Za-z0-9]*;)/', '&\1', $string);
return preg_replace_callback('%
(
<(?=[^a-zA-Z!/]) # a lone <
| # or
<!--.*?--> # a comment
| # or
<[^>]*(>|$) # a string that starts with a <, up until the > or the end of the string
| # or
> # just a >
)%x', '_filter_xss_split', $string);
}
/**
* Processes an HTML tag.
*
* @param $m
* An array with various meaning depending on the value of $store.
* If $store is TRUE then the array contains the allowed tags.
* If $store is FALSE then the array has one element, the HTML tag to process.
* @param $store
* Whether to store $m.
*
* @return
* If the element isn't allowed, an empty string. Otherwise, the cleaned up
* version of the HTML element.
*/
function _filter_xss_split($m, $store = FALSE) {
static $allowed_html;
if ($store) {
$allowed_html = array_flip($m);
return;
}
$string = $m[1];
if (substr($string, 0, 1) != '<') {
// We matched a lone ">" character.
return '>';
}
elseif (strlen($string) == 1) {
// We matched a lone "<" character.
return '<';
}
if (!preg_match('%^<\s*(/\s*)?([a-zA-Z0-9]+)([^>]*)>?|(<!--.*?-->)$%', $string, $matches)) {
// Seriously malformed.
return '';
}
$slash = trim($matches[1]);
$elem = &$matches[2];
$attrlist = &$matches[3];
$comment = &$matches[4];
if ($comment) {
$elem = '!--';
}
if (!isset($allowed_html[strtolower($elem)])) {
// Disallowed HTML element.
return '';
}
if ($comment) {
return $comment;
}
if ($slash != '') {
return "</$elem>";
}
// Is there a closing XHTML slash at the end of the attributes?
$attrlist = preg_replace('%(\s?)/\s*$%', '\1', $attrlist, -1, $count);
$xhtml_slash = $count ? ' /' : '';
// Clean up attributes.
$attr2 = implode(' ', _filter_xss_attributes($attrlist));
$attr2 = preg_replace('/[<>]/', '', $attr2);
$attr2 = strlen($attr2) ? ' ' . $attr2 : '';
return "<$elem$attr2$xhtml_slash>";
}
/**
* Processes a string of HTML attributes.
*
* @return
* Cleaned up version of the HTML attributes.
*/
function _filter_xss_attributes($attr) {
$attrarr = array();
$mode = 0;
$attrname = '';
while (strlen($attr) != 0) {
// Was the last operation successful?
$working = 0;
switch ($mode) {
case 0:
// Attribute name, href for instance.
if (preg_match('/^([-a-zA-Z]+)/', $attr, $match)) {
$attrname = strtolower($match[1]);
$skip = ($attrname == 'style' || substr($attrname, 0, 2) == 'on');
$working = $mode = 1;
$attr = preg_replace('/^[-a-zA-Z]+/', '', $attr);
}
break;
case 1:
// Equals sign or valueless ("selected").
if (preg_match('/^\s*=\s*/', $attr)) {
$working = 1; $mode = 2;
$attr = preg_replace('/^\s*=\s*/', '', $attr);
break;
}
if (preg_match('/^\s+/', $attr)) {
$working = 1; $mode = 0;
if (!$skip) {
$attrarr[] = $attrname;
}
$attr = preg_replace('/^\s+/', '', $attr);
}
break;
case 2:
// Attribute value, a URL after href= for instance.
if (preg_match('/^"([^"]*)"(\s+|$)/', $attr, $match)) {
$thisval = filter_xss_bad_protocol($match[1]);
if (!$skip) {
$attrarr[] = "$attrname=\"$thisval\"";
}
$working = 1;
$mode = 0;
$attr = preg_replace('/^"[^"]*"(\s+|$)/', '', $attr);
break;
}
if (preg_match("/^'([^']*)'(\s+|$)/", $attr, $match)) {
$thisval = filter_xss_bad_protocol($match[1]);
if (!$skip) {
$attrarr[] = "$attrname='$thisval'";
}
$working = 1; $mode = 0;
$attr = preg_replace("/^'[^']*'(\s+|$)/", '', $attr);
break;
}
if (preg_match("%^([^\s\"']+)(\s+|$)%", $attr, $match)) {
$thisval = filter_xss_bad_protocol($match[1]);
if (!$skip) {
$attrarr[] = "$attrname=\"$thisval\"";
}
$working = 1; $mode = 0;
$attr = preg_replace("%^[^\s\"']+(\s+|$)%", '', $attr);
}
break;
}
if ($working == 0) {
// Not well formed; remove and try again.
$attr = preg_replace('/
^
(
"[^"]*("|$) # - a string that starts with a double quote, up until the next double quote or the end of the string
| # or
\'[^\']*(\'|$)| # - a string that starts with a quote, up until the next quote or the end of the string
| # or
\S # - a non-whitespace character
)* # any number of the above three
\s* # any number of whitespaces
/x', '', $attr);
$mode = 0;
}
}
// The attribute list ends with a valueless attribute like "selected".
if ($mode == 1 && !$skip) {
$attrarr[] = $attrname;
}
return $attrarr;
}
/**
* Processes an HTML attribute value and strips dangerous protocols from URLs.
*
* @param $string
* The string with the attribute value.
* @param $decode
* (deprecated) Whether to decode entities in the $string. Set to FALSE if the
* $string is in plain text, TRUE otherwise. Defaults to TRUE.
*
* @return
* Cleaned up and HTML-escaped version of $string.
*/
function filter_xss_bad_protocol($string, $decode = TRUE) {
// Get the plain text representation of the attribute value (i.e. its meaning).
if ($decode) {
$string = decode_entities($string);
}
return check_plain(strip_dangerous_protocols($string));
}
/**
* Strips dangerous protocols (e.g. 'javascript:') from a URI.
*
* @param $uri
* A plain-text URI that might contain dangerous protocols.
*
* @return
* A plain-text URI stripped of dangerous protocols. As with all plain-text
* strings, this return value must not be output to an HTML page without
* check_plain() being called on it. However, it can be passed to functions
* expecting plain-text strings.
*
*/
function strip_dangerous_protocols($uri) {
static $allowed_protocols;
if (!isset($allowed_protocols)) {
$allowed_protocols = array_flip(array('ftp', 'http', 'https', 'irc', 'mailto', 'news', 'nntp', 'rtsp', 'sftp', 'ssh', 'tel', 'telnet', 'webcal'));
}
// Iteratively remove any invalid protocol found.
do {
$before = $uri;
$colonpos = strpos($uri, ':');
if ($colonpos > 0) {
// We found a colon, possibly a protocol. Verify.
$protocol = substr($uri, 0, $colonpos);
// If a colon is preceded by a slash, question mark or hash, it cannot
// possibly be part of the URL scheme. This must be a relative URL, which
// inherits the (safe) protocol of the base document.
if (preg_match('![/?#]!', $protocol)) {
break;
}
// Check if this is a disallowed protocol. Per RFC2616, section 3.2.3
// (URI Comparison) scheme comparison must be case-insensitive.
if (!isset($allowed_protocols[strtolower($protocol)])) {
$uri = substr($uri, $colonpos + 1);
}
}
} while ($before != $uri);
return $uri;
}
/**
* Encodes special characters in a plain-text string for display as HTML.
*
* Also validates strings as UTF-8 to prevent cross site scripting attacks on
* Internet Explorer 6.
*
* @param $text
* The text to be checked or processed.
*
* @return
* An HTML safe version of $text, or an empty string if $text is not
* valid UTF-8.
*
* @see validate_utf8()
* @ingroup sanitization
*/
function check_plain($text) {
return htmlspecialchars($text, ENT_QUOTES, 'UTF-8');
}
/**
* Decodes all HTML entities (including numerical ones) to regular UTF-8 bytes.
*
* Double-escaped entities will only be decoded once ("&lt;" becomes "<"
* , not "<"). Be careful when using this function, as decode_entities can
* revert previous sanitization efforts (<script> will become <script>).
*
* @param $text
* The text to decode entities in.
*
* @return
* The input $text, with all HTML entities decoded once.
*/
function decode_entities($text) {
return html_entity_decode($text, ENT_QUOTES, 'UTF-8');
}
/**
* Checks whether a string is valid UTF-8.
*
* All functions designed to filter input should use validate_utf8
* to ensure they operate on valid UTF-8 strings to prevent bypass of the
* filter.
*
* When text containing an invalid UTF-8 lead byte (0xC0 - 0xFF) is presented
* as UTF-8 to Internet Explorer 6, the program may misinterpret subsequent
* bytes. When these subsequent bytes are HTML control characters such as
* quotes or angle brackets, parts of the text that were deemed safe by filters
* end up in locations that are potentially unsafe; An onerror attribute that
* is outside of a tag, and thus deemed safe by a filter, can be interpreted
* by the browser as if it were inside the tag.
*
* The function does not return FALSE for strings containing character codes
* above U+10FFFF, even though these are prohibited by RFC 3629.
*
* @param $text
* The text to check.
*
* @return
* TRUE if the text is valid UTF-8, FALSE if not.
*/
function validate_utf8($text) {
if (strlen($text) == 0) {
return TRUE;
}
// With the PCRE_UTF8 modifier 'u', preg_match() fails silently on strings
// containing invalid UTF-8 byte sequences. It does not reject character
// codes above U+10FFFF (represented by 4 or more octets), though.
return (preg_match('/^./us', $text) == 1);
}
答案 4 :(得分:0)
对于perl脚本或cgi,您可以使用HTML::Entities:
use HTML::Entities;
$str = encode_entities($str, '<>&"');