Software creative always looking for a challenge

Escaping HTML in Java


HTML uses some special characters to control how a page is displayed. These characters need to be escaped before placed on a page if they are to be displayed as part of the page content (and not just to control how the page appears). This is similar to the way double quote characters in a C/C++ string have to be escaped in order for code to compile properly. Therefore, a web application needs to escape all user input before rendering HTML back to the user.

There are 2 ways to deal with this, both with their strengths and weaknesses:

  1. Filtering – throwing away all characters that are not in the set of acceptable input characters.
  2. Escaping special characters – escaping all special characters by turning them into their respective HTML entities.

In this short article we will cover the second way. We note however that in order to effectively protect against Cross Site Scripting(XSS) vulnerabilities a combination of both approaches may need to be used. Automated testing are the key to ensuring your application handles all input correctly, preventing malicious acts.

In PHP there is a very useful function: htmlentities which escapes all the potentially risky characters. Java does not have a built-in library to do this, but apache offers StringEscapeUtils you can download here. This library offer two methods for handling HTML, enscapeHtml and unescapeHtml, you can use to do the same thing.

However, if you look at the code, you will notice it uses an older version of Java before support for regular expressions was added. This makes the code more complicated. Therefore, I created my own utility class to encode and decode html entities. It is easy to extend and modify.
Without any further ado here is the code:

Be Sociable, Share!