Before you start

To understand this article, you should be comfortable with using a browser. If you also know how to create and edit text files, you can test the examples shown in this article.

Most fundamentally, when you look at a webpage in a Web browser, you see words. But most of the time webpages contain styled text rather than plain text.  Nowadays, webpage designers have access to hundreds of different fonts, font sizes, colors, and even alphabets (e.g. Spanish, Japanese, Russian), and browsers can for the most part display them accurately. Webpages may also contain images, video clips, and background music. They may include drop-down menus, search boxes, or links you can follow to access other pages (whether on the same site or another site). Some websites even let visitors customize the page display to accommodate their preferences and challenges (e.g., sight challenges, deafness, or color blindness). Often a page contains content boxes that move (scroll) while the rest of the page remains static.

A typical webpage depends on several technologies (such as CSS, JavaScript, Flash, AJAX, JSON) to control what the end-user sees, but most fundamentally, developers write webpages in HTML, without which there can be no webpages. To display the page on the client-side device, a browser starts out by reading the HTML.

The W3C (World Wide Web Consortium) and the WHATWG (Web Hypertext Application Technology Working Group) maintains the HTML international standards and specifications. The WHATWG treats HTML as a constantly evolving "living standard", whereas the W3C works both on HTML evolution (HTML 5.1) and on "snapshots" of HTML (of which the most recent is HTML5).

The HTML specification defines a single language that can be written either with the relaxed HTML syntax or the stricter XML syntax (Extensible Markup Language). HTML also addresses the needs of Web apps. HTML only describes the meaning of the content, not style and formatting. To define appearance and layout, please use CSS, not explicit presentational HTML.

This article provides an introduction to HTML. If you've ever wondered what goes on behind the scenes in your web browser, this article is the place to start learning.

History of HTML

Tim Berners-Lee, physicist at CERN (the European Organization for Nuclear Research), devised a way in the late 1980s for scientists to share documents over the Internet. Before that, Internet communication had been limited to plain text, using technologies such as email, FTP (File Transfer Protocol), and Usenet-based discussion boards. HTML used a content model stored on a central server but transferrable to a local workstation and viewable in a browser, simplifying access to content and making "rich" content possible (such as sophisticated text formatting and images). HTML is derived from SGML, which is a complex syntax for marking up or binding of content (text or graphics) in documents; as of HTML5, HTML no longer attempts to adhere to SGML syntax.

What is HTML?

HTML is a markup language. The word markup was used by editors who marked up manuscripts (usually with a blue pencil) when giving instructions for revisions. "Markup" now means something slightly different: a language with specific syntax that instructs a Web browser how to display a page. Once again, HTML separates "content" (words, images, audio, video, and so on) from "presentation" (instructions for displaying each type of content). HTML uses a pre-defined set of elements to define content types. Elements contain one or more "tags" that contain or express content. Tags are enclosed by angle brackets, and the closing tag begins with a forward slash.

The basic HTML code structure is shown below:

<html>
<head>
    <title>Page title here</title>
</head>
<body>
    This is sample text...
    <!-- We use this syntax to write comments -->
    <!-- Page content and rest of the tage here... -->
    <!-- This is the actual area that gets shown in the browser -->
</body>
</html>

Most browsers allow the user to view the HTML of any webpage. In Firefox, for example, press Ctrl + U to view the page source. Beginners will find the code nearly unreadable for a complex page, but if you spend some time looking at the code for a simple page and comparing it to the page the code renders, you will soon develop a clear understanding of how the syntax works.

The paragraph element consists of the start tag "<p>" and the closing tag "</p>". The following example shows a paragraph contained within the HTML paragraph element. Remember that your browser will not display more than one space character in a row.

<p>You are beginning to learn HTML.</p>

When this content is displayed in a web browser, it looks like this:

The browser uses tags as an indicator of how to display the content in the tags.

Usually elements containing content can also contain other elements. For example, the emphasis element ("<em>") can be embedded within a paragraph element, to add emphasis to a word or phrase:

<p>You are <em>beginning</em> to learn HTML.</p>

When displayed, this looks like:

Some elements cannot contain other elements. For example, the image tag ("<img>") specifies the filename of the content (an image) as an attribute:

<img src="smileyface.jpg" alt="Smiley face" >

In XHTML (an XML schema that implements HTML elements), you'd often put a forward slash before the final angle bracket to indicate the ending of an empty element.

The rest of this article goes into greater detail regarding the concepts introduced in this section. However, if you want to see HTML in action, check out Mozilla Thimble, an interactive online editor that helps you learn how to write HTML. Also, see HTML Elements for a list of available elements and a description of their use.

Elements — the basic building blocks

HTML consists of a set of elements, which define the semantic meaning of their content. Elements include two matching tags and everything in between. For example, the "<p>" element indicates a paragraph; the "<img>" element indicates an image. See the HTML Elements page for a complete list. Note: Some tags are self-closing and do not contain any content.

Some elements have very precise meaning, as in "this is an image", "this is a heading", or "this is an ordered list." Others are less specific, such as "this is a section on the page" or "this is part of the text." Yet others are used for technical reasons, such as "this is identifying information for the page, so do not display it." Regardless, in one way or another all HTML elements have a semantic value.

Most elements may contain other elements, forming a hierarchical structure. A very simple but complete webpage looks like this:

<html>
  <head>
    <title>A minimal web page</title>
  </head>
  <body>
    <p>You are in the beginning stage of learning HTML.</p>
  </body>
</html>

As you can see, the <html> element surrounds the rest of the document, and the <body> element surrounds the page content. This structure is often thought of as a tree with branches (in this case, the <body> and <p> elements) growing from the trunk (<html>). This hierarchical structure is called the DOM (document object model).

Tags

HTML documents are written in plain text. You can write HTML in any text editor that lets you save content as plain text (e.g. Notepad, Notepad++, or Sublime Text),  but most HTML authors prefer to use a specialized editor that highlights syntax and shows the DOM. You may use uppercase for tag names, but the W3C (the global consortium that maintains the HTML standard) recommends using lower case (and XHTML requires lower case).

HTML attaches special meaning to anything that starts with the less-than sign ("<") and ends with the greater-than sign (">"). Such markup is called a tag. Make sure to close the tag, as some tags are closed by default, whereas others might produce unexpected errors if you forget the end tag. 

Here is a simple example:

<p>This is text within a paragraph.</p>

In this example there is a start tag and a closing tag. Closing tags are the same as the start tag except with a forward slash immediately after the leading less-than sign. Most elements in HTML are written using both start and closing tags. To write valid code, you must properly nest start and closing tags, that is, write close tags in the opposite order from the start tags.

This is an an example of valid code:

<em>I <strong>really</strong> mean that</em>.

This is an example of invalid code:

<!--Invalid:--> <em>I <strong>really</em> mean that</strong>.

In the valid example, the inner <strong> element is closed before the outer <em> element.

Some elements do not contain any text content or any other elements. These are empty elements and need no closing tag. This is an example:

<img src="smileyface.jpg" alt="Smiley face" >

Empty elements in XHTML mode are usually closed using a trailing forward slash.

<img src="smileyface.jpg" alt="Smiley face" />

In HTML this slash has a meaning that is not implemented in Firefox so it should not be used.  Do not close empty elements in HTML mode.

Attributes

The start tag may contain additional information, as in the preceding example. Such information is called an attribute. Attributes usually consist of 2 parts:

  • An attribute name
  • An attribute value

A few attributes can only have one value. They are Boolean attributes and may be shortened by only specifying the attribute name or leaving the attribute value empty. Thus, the following 3 examples have the same meaning:

<input required="required">

<input required="">

<input required>

Attribute values that consist of a single word or number may be written as they are, but you must enclose values containing spaces in quotation marks (either single (') or double (") quotes). Many developers always use quotes to avoid mistakes like this:

<p class=foo bar> <!--(Beware, this probably does not mean what you think it means.)-->

In this example the value was supposed to be "foo bar" but since there were no quotes the code is interpreted as if it had been written like this:

<p class="foo" bar="">

Named character references

Named character references (often casually called entities) are used to print characters that have a special meaning in HTML. For example, HTML interprets the less-than and greater-than symbols as tag delimiters. When you want to display a greater-than symbol in the text, you can use a named character reference. You should know these four common named character references:

  • &gt; denotes the greater-than sign (>)
  • &lt; denotes the less-than sign (<)
  • &amp; denotes the ampersand (&)
  • &quot; denotes double quote (")

There are many more entities, but these four are the most important because they represent characters that have a special meaning in HTML.

Comments and doctype

HTML has a mechanism for embedding comments that are not displayed when the page is rendered in a browser. This is useful for explaining a section of markup, leaving notes for other people who might work on the page, or for leaving reminders for yourself. HTML comments are enclosed in symbols as follows:

<!-- This is comment text -->

Besides tags, text content, and entities, an HTML document must contain a doctype declaration as the first line. The doctype declaration is not an HTML tag, but rather tells the browser which version of HTML the page is written in.

In HTML5, there is only one declaration and is written like this:

<!DOCTYPE html>

The doctype has a long and intricate history, but for now all you need to know is that this doctype tells the browser to interpret the HTML and CSS code according to W3C standards and not try to pretend that it is Internet Explorer from the 90s. (See quirks mode.)

In HTML 4.01, doctype referred to a DTD (Document Type Definition) as it was based on SGML. There were three different doctype declarations in HTML 4.01: strict, transitional, and frameset.

The strict DTD contained all HTML elements and attributes, but NOT presentational or deprecated elements (like font). Framesets were not allowed.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

The transitional DTD contained all HTML elements and attributes, INCLUDING presentational and deprecated elements (like font). Framesets were not allowed.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">

The frameset DTD allowed frameset content and otherwise was the same as HTML 4.01 Transitional.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" "http://www.w3.org/TR/html4/frameset.dtd">

 

A complete but small document

 

Putting this together, here is tiny example HTML document. You can copy this code to a text editor, save it as myfirstdoc.html, and load it in a browser. Make sure you are saving it using the character encoding UTF-8. Since this document uses no styling it will look very plain, but it's just a small start.

<!DOCTYPE html>
<html lang="en">
<head>
  <title>A tiny document</title>
</head>
<body>
  <h1>Main heading in my document</h1>
  <!-- Note that it is "h" + "1", not "h" + the letters "one" -->
  <p>Look Ma, I am coding <abbr title="Hyper Text Markup Language">HTML</abbr>.</p>
</body>
</html>