I believe there are a lot of good articles about this around the Web, but here is a short summary.
Both UTF-8 and UTF-16 are variable length encodings. However, in UTF-8 a character may occupy a minimum of 8 bits, while in UTF-16 character length starts with 16 bits.
Main UTF-8 pros:
Basic ASCII characters like digits, Latin characters with no accents, etc. occupy one byte which is identical to US-ASCII representation. This way all US-ASCII strings become valid UTF-8, which provides decent backwards compatibility in many cases.
No null bytes, which allows to use null-terminated strings, this introduces a great deal of backwards compatibility too.
Main UTF-8 cons:
Many common characters have different length, which slows indexing and calculating a string length terribly.
Main UTF-16 pros:
Most reasonable characters, like Latin, Cyrillic, Chinese, Japanese can be represented with 2 bytes. Unless really exotic characters are needed, this means that the 16-bit subset of UTF-16 can be used as a fixed-length encoding, which speeds indexing.
Main UTF-16 cons:
Lots of null bytes in US-ASCII strings, which means no null-terminated strings and a lot of wasted memory.
In general, UTF-16 is usually better for in-memory representation while UTF-8 is extremely good for text files and network protocols.
Partager