Web Development

Dr Derek Bridge
School of Computer Science & Information Technology
University College Cork

Lecture Objectives

learn that an HTML web page is a tree of elements
learn about the rules of HTML and how to validate your web page
learn how to choose the correct character set

Hierarchichal Structure

Because elements are nested, web pages have a hierarchical structure.

Terminology:
tree
node
root    leaf
parent    child
ancestor    descendant
sibling

Hierarchichal Structure


<!DOCTYPE html>
<html>
<head>
    <title>A simple document</title>
</head>
<body>
    <p>
        Some words.
    </p>
    <p>
        More words 
        <em>and emphasised words</em>
        and final words.
    </p>
</body>
</html>

The HTML node has head and body nodes as its children. The head node has a title node as its child. The title node has a text node as its child. The body node here has two p nodes as children. The first p node has a text node as its child. The second p has three children: a text node, an em node and another text node. The em node has a text node as its child.

HTML Syntax Rules

You should ensure your HTML is well-formed.

Only certain tags are permitted.
Only certain attributes and values are permitted.
- E.g. any element can have a lang attribute (it's a global attribute).
- E.g. but only a few elements can have an alt attribute (e.g. <img>).

More Syntax Rules

Often, only certain content is permitted.
- E.g. <p> elements cannot contain other <p> elements.
- E.g. <p> elements cannot contain <ul> or <ol> elements.
- E.g. <ul> and <ol> elements can only contain zero, one or more <li> elements.

Another HTML Syntax Rule

Nesting must be done properly.


<p>This is <em>correct</em></p>


<p>This is <em>incorrect</p></em>

Otherwise there is difficulty building the tree.

Syntax Rules

A web developer writes HTML to produce a nested list.

badgers
wombats
- common wombat
- hairy-nosed wombat
squirrels


<ul>
    <li>badgers</li>
    <li>wombats</li>    
        <ul>
            <li>common wombat</li>
            <li>hairy-nosed wombat</li>
        </ul>
    <li>squirrels</li>
</ul>

Which rule is broken?

Nested Lists Done Correctly

Incorrect


<ul>
    <li>badgers</li>
    <li>wombats</li>    
        <ul>
            <li>common wombat</li>
            <li>hairy-nosed wombat</li>
        </ul>
    <li>squirrels</li>
</ul>

Correct


<ul>
    <li>badgers</li>
    <li>wombats    
        <ul>
            <li>common wombat</li>
            <li>hairy-nosed wombat</li>
        </ul>
    </li>
    <li>squirrels</li>
</ul>

Two Sets of Syntax Rules!

In fact, HTML5 has two sets of syntax rules!

XML syntax uses a very strict set of rules.

HTML syntax allows you to break the rules…
in certain cases.

XML Syntax

In XML syntax, e.g.:
tags must be in lowercase;
each start tag must have an end tag or, for void elements, an extra slash;
attribute values must be quoted;
…and so on.

Ironically, it's easier to use the strict XML syntax than the HTML syntax!

Breaking the Rules

What does your browser do if your web page is not well-formed?

Browsers (almost) never give error messages.
They do their best to build the tree and display the page.

HTML Validation

If browsers don't give error messages,
how do you know if your page is well-formed or not?

You can validate your page:
https://validator.nu/
https://validator.w3.org/

Character Sets

A character set is a collection of characters.
E.g. the ASCII character set is 128 characters, mostly from the modern Latin alphabet.
E.g. the Unicode character set is currently nearly 155,000 characters.

Coded Character Sets

A coded character set assigns a unique number to each distinct character.
E.g. in Unicode (and ASCII) 'A' is 65 and 'a' is 97 (decimal).

Character Encodings

A character encoding refers to the way the numbers are converted to bytes for storage and transmission.

ASCII	7 bits for every character
UTF-32	4 bytes for every character
UTF-8	1 byte for ASCII characters and 2, 3 or 4 for others

Character Encoding of Web Pages

Browsers and the HTML validator need to know which character encoding was used to create your web page.

Find out which encoding your editor is using and specify that character encoding in a meta element in the <head> of your HTML, e.g.:


<meta charset="utf-8" />

Wrong Character Encoding

What happens if your editor uses one encoding but you specify a different one?
- Some characters may display as other characters.
- Some characters may display as �

A Better Solution?

The Apache web server can be configured so that, when it serves a text file, it converts it, e.g., to UTF-8 — irrespective of its original character encoding.
And it specifies the new character encoding in the Content-Type HTTP header.
Browsers treat the HTTP header as more authoritative than the <meta> element.

`charset` in HTTP responses

An HTTP response can specify the charset in the Content-Type header

Chrome

Since version 55 of Chrome, Google doesn't even look at the charset tag anymore. It assumes utf-8, no matter what you have written.

Other browsers are following suit.

Reserved Characters

Some characters have a special meaning in HTML.
To display them in a web page, you may need to use their character references:

<	`<`
>	`>`

"	`"`
&	`&`

More Character References

Suppose your want to display a character that is not part of your character set or not easy to type on your keyboard.
E.g. You are using ASCII but you want to display á.

Then, you can use a character reference, e.g.:

Á	`Á`
á	`á`

€	`€`
½	`½`

Of course, if you are using Unicode (e.g. your charset is UTF-8), then this is less relevant.

Character References

You can include them by name (if they have one), by hexadecimal number or by decimal number, e.g.

https://dev.w3.org/html5/html-author/charref

Web Development

Lecture Objectives

Hierarchichal Structure

Hierarchichal Structure

HTML Syntax Rules

More Syntax Rules

Another HTML Syntax Rule

Syntax Rules

Nested Lists Done Correctly

Two Sets of Syntax Rules!

XML Syntax

Breaking the Rules

HTML Validation

Character Sets

Coded Character Sets

Character Encodings

Character Encoding of Web Pages

Wrong Character Encoding

A Better Solution?

charset in HTTP responses

Chrome

Reserved Characters

More Character References

Character References

G'luck!

`charset` in HTTP responses