Skip to main content
C
CodeUtil

HTML Entity Encoding - Preventing XSS Attacks

Learn how HTML entity encoding protects your web applications from Cross-Site Scripting (XSS) attacks. Understand character encoding, when to encode user input, framework auto-escaping, and common vulnerabilities developers miss.

2025-06-1214 min
Related toolHTML Encoder/Decoder

Use the tool alongside this guide for hands-on practice.

The XSS attack that made me paranoid

I'll never forget the first time I saw XSS exploited on a site I built. A comment field. Someone posted what looked like innocent text, but hidden inside was a script that stole session cookies. The client called me at 11 PM because users were getting logged out randomly. Turned out someone was hijacking sessions.

Cross-Site Scripting lets attackers inject malicious JavaScript into pages viewed by other users. When the browser renders that page, the injected code runs with full privileges - reading cookies, stealing tokens, modifying the page. It's been in the OWASP Top 10 forever, and I still catch it in code reviews regularly.

The three flavors of XSS

Not all XSS is created equal. Understanding the types helps you know where to look and how to defend.

  • Stored XSS: The payload lives in your database. Comment fields, user profiles, forum posts. Every visitor who loads the page executes the script. This is the nightmare scenario.
  • Reflected XSS: The payload comes from the URL or form submission. Attacker sends victim a malicious link. When they click, boom. Requires social engineering but still dangerous.
  • DOM-based XSS: The payload never hits your server. Client-side JavaScript reads location.hash or similar and unsafely writes it to the DOM. Server-side protections are useless here.
  • Stored XSS is worst - no user action needed beyond normal browsing
  • Reflected XSS needs victims to click sketchy links
  • DOM XSS bypasses your server completely - you might never see it in logs

How HTML encoding saves you

HTML entity encoding is your first line of defense. It converts dangerous characters into their HTML entity equivalents. The browser displays them as text instead of interpreting them as code.

When someone tries to inject <script>alert('gotcha')</script>, encoding turns it into visible text. The browser shows the literal characters instead of executing the script.

  • < becomes &lt; - no more opening tags
  • > becomes &gt; - no more closing tags
  • & becomes &amp; - prevents entity injection
  • " becomes &quot; - can't break out of attributes
  • ' becomes &#x27; - handles single-quoted attributes
  • / becomes &#x2F; - extra safety for closing tags
  • That nasty script tag? Now it's just: &lt;script&gt;alert(&quot;XSS&quot;)&lt;/script&gt;

Context matters - a lot

Here's where I see people mess up: HTML encoding isn't a silver bullet. WHERE you're inserting user data determines WHAT encoding you need.

  • HTML body (<p>USER_DATA</p>): HTML entity encoding works here
  • HTML attributes (<div title="USER_DATA">): Need attribute encoding plus ALWAYS quote your attributes
  • JavaScript strings (var name = "USER_DATA"): Need JavaScript encoding, not HTML
  • URL parameters (<a href="/search?q=USER_DATA">): Need URL encoding
  • CSS values (<div style="background: USER_DATA">): Need CSS encoding
  • NEVER put untrusted data directly in script tags, event handlers, or CSS
  • I've seen HTML encoding applied in JavaScript context - completely useless against XSS

Framework auto-escaping: trust but verify

Modern frameworks escape output by default, which is amazing. But I've seen developers assume they're protected when they're not. Know your framework's limits.

  • React: {userInput} is safe. dangerouslySetInnerHTML is exactly as dangerous as the name suggests.
  • Vue: {{ interpolation }} is safe. v-html bypasses everything.
  • Angular: {{ interpolation }} is safe. [innerHTML] is not.
  • Django: Template variables are escaped. |safe filter turns off protection.
  • Auto-escaping only protects HTML body context - not JS, URLs, or CSS
  • Those bypass functions (dangerouslySetInnerHTML, v-html, |safe)? Only use with DOMPurify.
  • Server-side escaping is great, but your client-side code needs to be safe too

The XSS patterns I keep finding

Even with frameworks, certain patterns keep causing XSS. These slip through code reviews because they don't look obviously dangerous.

  • innerHTML: document.getElementById("output").innerHTML = userInput; - I see this constantly
  • href attributes: <a href="javascript:alert(1)"> executes even with HTML encoding
  • Event handlers: <div onclick="handler(USER_DATA)"> needs JS encoding, not HTML
  • URL fragments: location.hash is attacker-controlled. DOM XSS central.
  • JSON in scripts: <script>var data = USER_JSON;</script> can break out with </script>
  • Template literals with innerHTML: elem.innerHTML = `${userInput}` - concatenation with extra steps
  • Third-party embeds without sandboxing
  • Markdown renderers allowing raw HTML

Input validation is not output encoding

I get asked all the time: 'Can't I just filter out script tags?' No. These are different tools for different jobs.

  • Input validation: Restricts what enters your app (length, format, allowed characters)
  • Output encoding: Makes data safe when rendered in a specific context
  • Validation helps but can't prevent all XSS - legitimate text contains <, >, &
  • Encoding happens at output time - same data might appear in HTML, JS, URLs
  • Denylist filtering ("remove script tags") fails. Attackers have endless bypasses.
  • Allowlist validation ("only these characters allowed") is much stronger
  • I use both. Defense in depth.

Encoding functions I actually use

Don't write your own encoding function. Seriously. Use what's battle-tested.

  • JavaScript: Use textContent instead of innerHTML when inserting text
  • For HTML you need to render: DOMPurify.sanitize(userHTML)
  • PHP: htmlspecialchars($string, ENT_QUOTES, "UTF-8") - always set charset!
  • Python: html.escape(string) or let Django handle it
  • Java: OWASP Java Encoder - Encode.forHtml(input)
  • C#: System.Web.HttpUtility.HtmlEncode() or let Razor do it
  • Ruby/Rails: ERB escapes by default; CGI.escapeHTML() for manual
  • UTF-8 encoding specification prevents charset-based XSS bypasses

Content Security Policy: your backup plan

CSP is like a seatbelt for XSS. Even if your encoding fails somewhere, CSP limits what attackers can do. I implement it on everything now.

  • script-src 'self' blocks inline scripts - that <script>alert(1)</script> won't run
  • Without 'unsafe-eval', no eval() attacks
  • Nonce-based CSP: Only scripts with your secret nonce execute
  • My strict CSP: default-src 'self'; script-src 'self' 'nonce-abc123';
  • Start with report-only mode to catch issues without breaking things
  • CSP is defense in depth - it's not a replacement for encoding
  • Test thoroughly - strict CSP can break legitimate features

How I test for XSS

Every project at Šikulovi s.r.o. gets XSS testing. Manual plus automated. Here's my process:

  • Manual: <script>alert(1)</script> in every input field. Boring but essential.
  • Test all contexts: Where does input appear? HTML, attributes, JS, URLs?
  • My favorite payloads: <img src=x onerror=alert(1)>, " onmouseover="alert(1)
  • Automated: OWASP ZAP, Burp Suite scan everything
  • Static analysis: Semgrep, ESLint security plugins catch dangerous patterns in code
  • DevTools: Inspect rendered HTML to verify encoding is actually applied
  • Bypass attempts: URL encoding, mixed case, unicode alternatives
  • DOM XSS needs client-side testing - server scanners miss it completely

Patterns that keep me safe

After enough XSS incidents, these patterns became muscle memory. I don't even think about them anymore.

  • textContent/innerText for text. innerHTML only with DOMPurify.
  • createElement + appendChild instead of building HTML strings
  • Framework bindings (React JSX, Vue templates) instead of manual DOM work
  • DOMPurify.sanitize() before dangerouslySetInnerHTML or v-html
  • new URL(path, baseURL) for building URLs safely
  • Template parameters instead of string concatenation
  • Everything from users is untrusted: URL params, cookies, headers - all of it

The real defense is layers

No single technique stops all XSS. I use HTML encoding as the foundation, but it's part of a system: context-aware encoding, framework auto-escaping, Content Security Policy, and safe coding patterns.

Know where user data flows through your app. Encode appropriately at every output point. Don't bypass framework protections without sanitization. Add CSP. Test regularly. XSS is one of those vulnerabilities that never fully goes away - it just waits for the one place you forgot to encode.

FAQ

Is HTML encoding enough to prevent all XSS attacks?

HTML entity encoding prevents XSS in HTML body context but is insufficient alone. Different contexts (JavaScript, URLs, CSS, HTML attributes) require context-specific encoding. Additionally, some XSS vectors like javascript: URLs in href attributes execute despite HTML encoding. Use a combination of context-aware encoding, framework auto-escaping, Content Security Policy, and safe coding patterns for full protection.

Why does React have dangerouslySetInnerHTML if it is dangerous?

React escapes all values by default, but sometimes you genuinely need to render HTML—for example, content from a CMS, markdown rendering, or rich text editors. The deliberately alarming name reminds developers to sanitize content with a library like DOMPurify before using it. The name is a feature, not a bug—it forces conscious decisions about bypassing XSS protection.

What is the difference between encoding and sanitization?

Encoding converts special characters to safe representations (< becomes &lt;) and preserves all input—the user sees exactly what they typed. Sanitization removes or modifies dangerous content (strips script tags, removes event handlers) and may alter the input. Use encoding when displaying plain text; use sanitization when you need to allow some HTML formatting while blocking dangerous elements.

Can Content Security Policy replace HTML encoding?

No, CSP is defense in depth, not a replacement for encoding. CSP limits what scripts can execute if XSS occurs but does not prevent all XSS impacts—attackers can still modify page content, steal form data via CSS, or exfiltrate data through allowed endpoints. Some applications cannot use strict CSP due to legacy code. Always encode output properly and use CSP as an additional layer.

How do I handle user-generated HTML content safely?

Use a sanitization library like DOMPurify (JavaScript), Bleach (Python), or HTML Purifier (PHP). These libraries parse HTML and remove dangerous elements (script, iframe) and attributes (onclick, onerror) while preserving safe formatting. Configure the sanitizer with an allowlist of permitted tags and attributes. Never build your own HTML sanitizer—the edge cases are numerous and subtle.

Why do XSS attacks still occur in applications using modern frameworks?

Modern frameworks provide strong defaults, but developers bypass them: using dangerouslySetInnerHTML/v-html without sanitization, inserting user data into href or src attributes, building DOM with innerHTML, embedding user data in inline scripts, or misunderstanding which contexts the framework protects. Framework protection only works when developers understand and follow the secure patterns.

Should I encode user input on storage or on display?

Encode on display (output), not on storage (input). Store data in its original form and encode when rendering. This allows the same data to be safely rendered in different contexts (HTML, JSON, CSV) with appropriate encoding for each. Encoding on input causes double-encoding issues and makes it impossible to properly encode for contexts you did not anticipate at storage time.

What are the most commonly exploited XSS patterns?

The most exploited patterns include innerHTML assignments with user data, href attributes with javascript: URLs, event handler attributes containing user data, JSON embedded in script tags without encoding, URL fragments processed unsafely in JavaScript, and markdown or rich text rendering without sanitization. Code review should specifically check for these patterns.

Martin Šikula

Founder of CodeUtil. Web developer building tools I actually use. When I'm not coding, I experiment with productivity techniques (with mixed success).

Related articles