Web Development

Dr Derek Bridge
School of Computer Science & Information Technology
University College Cork

Lecture Objectives

  • learn about the rules of HTML
  • learn how to validate your web page
  • learn how to choose the correct character set

HTML Syntax Rules

You should ensure your HTML is well-formed.

  • Only certain tags are permitted.
  • Only certain attributes and values are permitted.
    • E.g. any element can have a lang attribute (it's a global attribute).
    • E.g. but only a few elements can have an alt attribute (e.g. <img>).

More Syntax Rules

  • Often, only certain content is permitted.
    • E.g. <p> elements cannot contain other <p> elements.
    • E.g. <p> elements cannot contain <ul> or <ol> elements.
    • E.g. <ul> and <ol> elements can only contain zero, one or more <li> elements.

Another HTML Syntax Rule

Nesting must be done properly.

Otherwise there is difficulty building the tree.

Syntax Rules

A web developer writes HTML to produce a nested list.

Which rule is broken?

Nested Lists Done Correctly

Two Sets of Syntax Rules!

In fact, HTML5 has two sets of syntax rules!

XML syntax uses a very strict set of rules.

HTML syntax allows you to break the rules…
in certain cases.

XML Syntax

In XML syntax, e.g.:
tags must be in lowercase;
each start tag must have an end tag or, for void elements, an extra slash;
attribute values must be quoted;
…and so on.

Ironically, it's easier to use the strict XML syntax than the HTML syntax!

Breaking the Rules

What does your browser do if your web page is not well-formed?

Browsers (almost) never give error messages.
They do their best to build the tree and display the page.

HTML Validation

If browsers don't give error messages,
how do you know if your page is well-formed or not?

You can validate your page:
https://html5.validator.nu/
https://validator.w3.org/

Character Sets

  • A character set is a collection of characters.
  • E.g. the ASCII character set is 128 characters, mostly from the modern Latin alphabet.
  • E.g. the Unicode character set is currently nearly 150,000 characters.

Coded Character Sets

  • A coded character set assigns a unique number to each distinct character.
  • E.g. in Unicode (and ASCII) 'A' is 65 and 'a' is 97 (decimal).

Character Encodings

A character encoding refers to the way the numbers are converted to bytes for storage and transmission.

ASCII 7 bits for every character
UTF-32 4 bytes for every character
UTF-8 1 byte for ASCII characters and 2, 3 or 4 for others

Character Encoding of Web Pages

Browsers need to know which character encoding was used to create your web page.

Find out which encoding your editor is using and specify that character encoding in a meta element in the <head> of your HTML, e.g.:


<meta charset="utf-8" />
    

Wrong Character Encoding

  • What happens if your editor uses one encoding but you specify a different one?
    • Some characters may display as other characters.
    • Some characters may display as �

A Better Solution?

  • The Apache web server can be configured so that, when it serves a text file, it converts it, e.g., to UTF-8 — irrespective of its original character encoding.
  • And it specifies the new character encoding in the Content-Type HTTP header.
  • Browsers treat the HTTP header as more authoritative than the <meta> element.

charset in HTTP responses

An HTTP response can specify the charset in the Content-Type header

Chrome

Since version 55 of Chrome, Google doesn't even look at the charset tag anymore. It assumes utf-8, no matter what you have written.

Other browsers are following suit.

Reserved Characters

  • Some characters have a special meaning in HTML.
  • To display them in a web page, you may need to use their character references:

More Character References

  • Suppose your want to display a character that is not part of your character set or not easy to type on your keyboard.
  • E.g. You are using ASCII but you want to display á.
  • Then, you can use a character reference, e.g.:
  • Of course, if you are using Unicode (e.g. your charset is UTF-8), then this is less relevant.

Character References

You can include them by name (if they have one), by hexadecimal number or by decimal number, e.g.

&aacute; &#x000E1; &#225;

https://dev.w3.org/html5/html-author/charref

G'luck!