ASCII
ASCII stands for American Standard Code for Information Interchange. It is a character encoding standard that assigns numerical values (codes) to represent characters, including letters, numbers, punctuation marks, and control characters.
The American Standard Code for Information Interchange (ASCII) was developed by a committee of the American Standards Association (ASA), called the X3 committee, by its X3.2.4 working group in early 60's. The ASA later became the United States of America Standards Institute (USASI) and ultimately became the American National Standards Institute (ANSI).
ASCII is a 7-bit character set containing 128 characters. 27=128. It contains the numbers from 0-9, the upper and lower case English letters from A to Z, and some special characters.
ASCII has been widely used in computers and communication equipment for encoding text data. However, with the need to represent characters beyond the original 128, extended versions of ASCII have been developed, such as ISO 8859 and UTF-8, which support a wider range of characters including those from non-English languages and special symbols.
ASCII Printable Characters
Char |
Number |
Description |
|
0 - 31 |
Control characters (see below) |
|
32 |
|
! |
33 |
|
" |
34 |
|
# |
35 |
|
$ |
36 |
|
% |
37 |
|
& |
38 |
|
' |
39 |
|
( |
40 |
|
) |
41 |
|
* |
42 |
|
+ |
43 |
|
, |
44 |
|
- |
45 |
|
. |
46 |
|
/ |
47 |
|
0 |
48 |
|
1 |
49 |
|
2 |
50 |
|
3 |
51 |
|
4 |
52 |
|
5 |
53 |
|
6 |
54 |
|
7 |
55 |
|
8 |
56 |
|
9 |
57 |
|
: |
58 |
|
; |
59 |
|
< |
60 |
|
= |
61 |
|
> |
62 |
|
? |
63 |
|
@ |
64 |
|
A |
65 |
|
B |
66 |
|
C |
67 |
|
D |
68 |
|
E |
69 |
|
F |
70 |
|
G |
71 |
|
H |
72 |
|
I |
73 |
|
J |
74 |
|
K |
75 |
|
L |
76 |
|
M |
77 |
|
N |
78 |
|
O |
79 |
|
P |
80 |
|
Q |
81 |
|
R |
82 |
|
S |
83 |
|
T |
84 |
|
U |
85 |
|
V |
86 |
|
W |
87 |
|
X |
88 |
|
Y |
89 |
|
Z |
90 |
|
[ |
91 |
|
\ |
92 |
|
] |
93 |
|
^ |
94 |
|
_ |
95 |
|
` |
96 |
|
a |
97 |
|
b |
98 |
|
c |
99 |
|
d |
100 |
|
e |
101 |
|
f |
102 |
|
g |
103 |
|
h |
104 |
|
i |
105 |
|
j |
106 |
|
k |
107 |
|
l |
108 |
|
m |
109 |
|
n |
110 |
|
o |
111 |
|
p |
112 |
|
q |
113 |
|
r |
114 |
|
s |
115 |
|
t |
116 |
|
u |
117 |
|
v |
118 |
lowercase v |
w |
119 |
lowercase w |
x |
120 |
|
y |
121 |
|
z |
122 |
|
{ |
123 |
|
| |
124 |
|
} |
125 |
|
~ |
126 |
ASCII Control Characters
Char |
Number |
Description |
NUL |
00 |
null character |
SOH |
01 |
start of header |
STX |
02 |
start of text |
ETX |
03 |
end of text |
EOT |
04 |
end of transmission |
ENQ |
05 |
enquiry |
ACK |
06 |
acknowledge |
BEL |
07 |
bell (ring) |
BS |
08 |
backspace |
HT |
09 |
horizontal tab |
LF |
10 |
line feed |
VT |
11 |
vertical tab |
FF |
12 |
form feed |
CR |
13 |
carriage return |
SO |
14 |
shift out |
SI |
15 |
shift in |
DLE |
16 |
data link escape |
DC1 |
17 |
device control 1 |
DC2 |
18 |
device control 2 |
DC3 |
19 |
device control 3 |
DC4 |
20 |
device control 4 |
NAK |
21 |
negative acknowledge |
SYN |
22 |
synchronize |
ETB |
23 |
end transmission block |
CAN |
24 |
cancel |
EM |
25 |
end of medium |
SUB |
26 |
substitute |
ESC |
27 |
escape |
FS |
28 |
file separator |
GS |
29 |
group separator |
RS |
30 |
record separator |
US |
31 |
unit separator |
|
|
|
DEL |
127 |
delete (rubout) |
URL Encodings
URL encoding, also known as percent-encoding, is a mechanism used to convert certain characters in a URL (Uniform Resource Locator) into a format that can be safely transmitted over the internet.
In URLs, certain characters have special meanings or functions. For example, the characters ?
, &
, /
, and =
are used to separate different parts of the URL or to denote query parameters. If you want to include these characters in the URL as part of data rather than as delimiters or special characters, they need to be encoded.
URL encoding works by replacing non-alphanumeric characters with a '%' followed by their ASCII hexadecimal value. For instance, a space character ' ' is represented as '%20', since its ASCII hexadecimal value is 20.
ANSI
The American National Standards Institute (ANSI) is a private, non-profit organization that administers and coordinates the U.S. voluntary standards and conformity assessment system. The ANSI character set was the standard set of characters used in Windows operating systems through Windows 95 and Windows NT, after which Unicode was adopted.
ISO-8859
ISO 8859, also known as Latin alphabet No. 1, is a series of character encoding standards developed by the International Organization for Standardization (ISO) in 1987. These standards are designed to extend the ASCII character set to include characters from various languages that use Latin scripts, such as English, French, German, Spanish, and many others. Each ISO 8859 standard defines an 8-bit character encoding, allowing for a total of 256 characters. There are several versions of the ISO 8859 standard, each tailored to support specific languages or language groups. The most commonly used versions include ISO 8859-1, ISO 8859-2, ISO 8859-3, and so on, up to ISO 8859-16. Each version provides support for additional characters beyond the original ASCII character set while maintaining compatibility with ASCII for the first 128 characters.
In ISO-8859-1, the characters from 128 to 159 are not defined. The next part of ISO-8859-1 (codes from 160-191) contains commonly used special characters. If you use the less than (<) or greater than (>) signs in your HTML text, the browser might mix them with tags. Entity names or entity numbers can be used to display reserved HTML characters. Entity names are represented as &entity_name; Entity numbers are represented as &#entity_number;
ISO-8859-1 Symbols (160-191)
Character |
Entity Number |
Enity Name |
Description |
|
  |
|
|
¡ |
¡ |
¡ |
|
¢ |
¢ |
¢ |
|
£ |
£ |
£ |
|
¤ |
¤ |
¤ |
|
¥ |
¥ |
¥ |
|
¦ |
¦ |
¦ |
|
§ |
§ |
§ |
|
¨ |
¨ |
¨ |
|
© |
© |
© |
|
ª |
ª |
ª |
|
« |
« |
« |
|
¬ |
¬ |
¬ |
|
|
­ |
­ |
|
® |
® |
® |
|
¯ |
¯ |
¯ |
|
° |
° |
° |
|
± |
± |
± |
|
² |
² |
² |
|
³ |
³ |
³ |
|
´ |
´ |
´ |
|
µ |
µ |
µ |
|
¶ |
¶ |
¶ |
|
· |
· |
· |
|
¸ |
¸ |
¸ |
|
¹ |
¹ |
¹ |
|
º |
º |
º |
|
» |
» |
» |
|
¼ |
¼ |
¼ |
|
½ |
½ |
½ |
|
¾ |
¾ |
¾ |
|
¿ |
¿ |
¿ |
inverted question mark |
ISO-8859-1 Characters (192-255)
Character |
Entity Number |
Entity Name |
Description |
À |
À |
À |
capital a, grave accent |
Á |
Á |
Á |
capital a, acute accent |
 |
 |
 |
capital a, circumflex accent |
à |
à |
à |
capital a, tilde |
Ä |
Ä |
Ä |
capital a, umlaut mark |
Å |
Å |
Å |
capital a, ring |
Æ |
Æ |
Æ |
capital ae |
Ç |
Ç |
Ç |
capital c, cedilla |
È |
È |
È |
capital e, grave accent |
É |
É |
É |
capital e, acute accent |
Ê |
Ê |
Ê |
capital e, circumflex accent |
Ë |
Ë |
Ë |
capital e, umlaut mark |
Ì |
Ì |
Ì |
capital i, grave accent |
Í |
Í |
Í |
capital i, acute accent |
Î |
Î |
Î |
capital i, circumflex accent |
Ï |
Ï |
Ï |
capital i, umlaut mark |
Ð |
Ð |
Ð |
capital eth, Icelandic |
Ñ |
Ñ |
Ñ |
capital n, tilde |
Ò |
Ò |
Ò |
capital o, grave accent |
Ó |
Ó |
Ó |
capital o, acute accent |
Ô |
Ô |
Ô |
capital o, circumflex accent |
Õ |
Õ |
Õ |
capital o, tilde |
Ö |
Ö |
Ö |
capital o, umlaut mark |
× |
× |
× |
multiplication |
Ø |
Ø |
Ø |
capital o, slash |
Ù |
Ù |
Ù |
capital u, grave accent |
Ú |
Ú |
Ú |
capital u, acute accent |
Û |
Û |
Û |
capital u, circumflex accent |
Ü |
Ü |
Ü |
capital u, umlaut mark |
Ý |
Ý |
Ý |
capital y, acute accent |
Þ |
Þ |
Þ |
capital THORN, Icelandic |
ß |
ß |
ß |
small sharp s, German |
à |
à |
à |
small a, grave accent |
á |
á |
á |
small a, acute accent |
â |
â |
â |
small a, circumflex accent |
ã |
ã |
ã |
small a, tilde |
ä |
ä |
ä |
small a, umlaut mark |
å |
å |
å |
small a, ring |
æ |
æ |
æ |
small ae |
ç |
ç |
ç |
small c, cedilla |
è |
è |
è |
small e, grave accent |
é |
é |
é |
small e, acute accent |
ê |
ê |
ê |
small e, circumflex accent |
ë |
ë |
ë |
small e, umlaut mark |
ì |
ì |
ì |
small i, grave accent |
í |
í |
í |
small i, acute accent |
î |
î |
î |
small i, circumflex accent |
ï |
ï |
ï |
small i, umlaut mark |
ð |
ð |
ð |
small eth, Icelandic |
ñ |
ñ |
ñ |
small n, tilde |
ò |
ò |
ò |
small o, grave accent |
ó |
ó |
ó |
small o, acute accent |
ô |
ô |
ô |
small o, circumflex accent |
õ |
õ |
õ |
small o, tilde |
ö |
ö |
ö |
small o, umlaut mark |
÷ |
÷ |
÷ |
division |
ø |
ø |
ø |
small o, slash |
ù |
ù |
ù |
small u, grave accent |
ú |
ú |
ú |
small u, acute accent |
û |
û |
û |
small u, circumflex accent |
ü |
ü |
ü |
small u, umlaut mark |
ý |
ý |
ý |
small y, acute accent |
þ |
þ |
þ |
small thorn, Icelandic |
ÿ |
ÿ |
ÿ |
small y, umlaut mark |
Variants of ISO-8859
Number |
Description |
Covers |
8859-1 |
Latin 1 |
North America, Western Europe, Latin America, the
Caribbean, Canada, Africa. |
8859-2 |
Latin 2 |
Eastern Europe. |
8859-3 |
Latin 3 |
SE Europe, Esperanto, miscellaneous others. |
8859-4 |
Latin 4 |
Scandinavia/Baltics (and others not in
ISO-8859-1). |
8859-5 |
Latin/Cyrillic |
The Cyrillic alphabet. Bulgarian, Belarusian,
Russian and Macedonian. |
8859-6 |
Latin/Arabic |
The Arabic alphabet. |
8859-7 |
Latin/Greek |
The modern Greek alphabet and mathematical
symbols derived from the Greek. |
8859-8 |
Latin/Hebrew |
The Hebrew alphabet. |
8859-9 |
Latin/Turkish |
The Turkish alphabet. Same as ISO-8859-1 except
Turkish characters replace Icelandic. |
8859-10 |
Latin/Nordic |
Nordic alphabets. Lappish, Nordic, Eskimo. |
8859-15 |
Latin 9 (Latin 0) |
Similar to ISO-8859-1 but replaces some less
common symbols with the euro sign and some other missing characters. |
2022-JP |
Latin/Japanese 1 |
The Japanese alphabet part 1. |
2022-JP-2 |
Latin/Japanese 2 |
The Japanese alphabet part 2. |
2022-KR |
Latin/Korean 1 |
The Korean alphabet. |
However, ISO 8859 has limitations, particularly in supporting characters from languages outside the Western European region, which led to the development of more comprehensive encoding standards like Unicode.
ANSI Code Page & Windows-1252
ANSI code pages officially called "Windows code pages after Microsoft accepted the former term being a misnomer are used for native non-Unicode (byte oriented) applications using a graphical user interface on Windows systems. The term "ANSI" is a misnomer because these Windows code pages didn't comply with any ANSI standard. Code page 1252 was based on an early ANSI draft that later became the international standard ISO 8859-1. Windows-1252 was the first default character set in Microsoft Windows. Undeclared charsets in HTML are also assumed to be Windows-1252. Windows-1252 is identical to ISO-8859-1 except for the code points 128-159 (0x80-0x9F). In ISO-8859-1, the characters from 128 to 159 are not defined. Windows-1252 has several characters, punctuation, arithmetic and business symbols assigned to these code points.
Character |
Number |
Entity Name |
Description |
€ |
128 |
€ |
|
|
129 |
|
|
‚ |
130 |
‚ |
|
ƒ |
131 |
ƒ |
|
„ |
132 |
„ |
|
… |
133 |
… |
|
† |
134 |
† |
|
‡ |
135 |
‡ |
|
ˆ |
136 |
ˆ |
|
‰ |
137 |
‰ |
|
Š |
138 |
Š |
|
‹ |
139 |
‹ |
|
Œ |
140 |
Œ |
|
|
141 |
|
|
Ž |
142 |
Ž |
|
|
143 |
|
|
|
144 |
|
|
‘ |
145 |
‘ |
|
’ |
146 |
’ |
|
“ |
147 |
“ |
|
” |
148 |
” |
|
• |
149 |
• |
|
– |
150 |
– |
|
— |
151 |
— |
|
˜ |
152 |
˜ |
|
™ |
153 |
™ |
|
š |
154 |
š |
|
› |
155 |
› |
|
œ |
156 |
œ |
|
|
157 |
|
|
ž |
158 |
ž |
|
Ÿ |
159 |
Ÿ |
HEXADECIMAL
Hexadecimal, often abbreviated as "hex," is a base-16 numeral system used in mathematics and computer science. In hexadecimal, numbers are represented using 16 symbols: the digits 0-9 and the letters A-F (where A represents 10, B represents 11, and so on up to F representing 15). Hexadecimal is commonly used in computing because it provides a more concise way to represent binary data, as each hexadecimal digit corresponds to four binary digits (bits).
Decimal |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
Hexadecimal |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
A |
B |
C |
D |
E |
F |
After reaching 9 in decimal, hexadecimal uses letters A-F to represent values from 10 to 15. Lets convert decimal value 269 to hexadecimal..
,
The hexadecimal number is the reverse of the remainder we get in each step.
Base64
Base64 is a binary-to-text encoding scheme that represents binary data in an ASCII string format by translating it into a radix-64 representation. It's used to encode binary data, such as images, audio files, or any other binary content, into a format that can be transmitted over text-based channels, such as email or URLs, without corruption due to special characters or encoding issues. The binary data is divided into groups of 6 bits. Each group of 6 bits is then represented by a character from a predefined set of 64 printable ASCII characters. These characters typically include uppercase and lowercase letters (A-Z, a-z), digits (0-9), and two additional symbols (usually '+' and '/'). Padding characters ('=') are added if the number of bits in the original binary data is not divisible by 6. For example, consider the sentence Hi\n, where the \n represents a newline. The first step in the encoding process is to obtain the binary representation of each ASCII character.
UTF-8
UTF-8, which stands for Unicode Transformation Format 8-bit, is a variable-width character encoding capable of encoding all possible Unicode code points. It's the most commonly used encoding on the internet and in computing systems worldwide because it efficiently represents a wide range of characters while maintaining backward compatibility with ASCII.
UTF-8 converts a code point (which represents a single character in Unicode) into a set of one to four bytes. UTF-8 is compact and efficient, especially for languages that use mostly ASCII characters. For example, an English text encoded in UTF-8 will use the same space as of ASCII text. A code point is a number assigned to represent an abstract character in unicode. The code point for a character is typically represented in hexadecimal notation. For example, the code point for the letter "A" is U+0041, where "U+" indicates that the following digits represent a Unicode code point, and "0041" is the hexadecimal representation of the code point.
In Unicode, a "plane" refers to a continuous group of 65,536 (2^16) code points. Unicode is organized into a multilevel hierarchical structure, and planes are one of the key components of this structure. The Unicode Standard assigns code points to different planes to accommodate a vast number of characters from various scripts and symbol sets. Unicode divides its code space into 17 planes, labeled from 0 to 16 (0x0 to 0x10 in hexadecimal). Each plane contains 65,536 code points, providing a total of 1,114,112 (17 * 65,536) code points. The first plane, Plane 0 (U+0000 to U+FFFF), known as the Basic Multilingual Plane (BMP), contains most commonly used characters for modern text processing, covering scripts such as Latin, Cyrillic, Greek, Hebrew, Arabic, Chinese, Japanese, and Korean, as well as many symbols, punctuation marks, and control characters. Planes 1 through 16 are referred to as "supplementary planes." They contain additional characters and symbols, including historical scripts, rare characters, emoji, mathematical symbols, musical symbols, and more. Plane 1 in Unicode, also known as the Supplementary Multilingual Plane (SMP), consists of code points ranging from U+10000 to U+1FFFF. Likewise Plane 2 consists of code points ranging from U+20000 to U+2FFFF.
In UTF-8 encoding, code points in Plane 0 are represented using sequences of one to three bytes, depending on the code point's value
- For code points in the range U+0000 to U+007F (0 to 127), UTF-8 encodes them as follows: Code points in this range are represented using a single byte. The byte's value directly corresponds to the code point's value.
- For code points in the range U+0080 to U+07FF, UTF-8 encodes them as follows: Code points in this range are represented using two bytes. The high-order 5 bits of the code point are stored in the first byte, and the low-order 6 bits are stored in the second byte.
- For code points in the range U+0800 to U+FFFF, UTF-8 encodes them as follows: Code points in this range are represented using three bytes. The high-order 4 bits of the code point are stored in the first byte, the next 6 bits in the second byte, and the low-order 6 bits in the third byte.
Here's a general pattern for representing code points in Plane 0 using UTF-8:
- One-byte sequence: 0xxxxxxx
- Two-byte sequence: 110xxxxx 10xxxxxx
- Three-byte sequence: 1110xxxx 10xxxxxx 10xxxxxx
Where 'x' represents bits from the code point.
For example, let's say we have the code point U+0041. Its binary representation is:
U+0041 = 0000 0000 0100 0001
To represent this code point in UTF-8:
Since it falls in the range U+0000 to U+007F, it is represented using a single byte.
Therefore, the UTF-8 representation of U+0041 would be: 01000001
Lets take another example U+0081
Its binary representation is: 0000 0000 1000 0001
The high-order 5 bits = 00010 & low-order 6 bits = 000001
U+0081=11000010 10000001
Lets take another example U+FFFF
Its binary representation is: 1111 1111 1111 1111
high order 4 bits=1111
first low order 6 bits=111111
second low order 6 bits=111111
U+FFFF= 11101111 10111111 10111111
Lets take another example UTF-8 representation of U+10348 which is a plane 1 codepoint Binary representation = 0001 0000 0011 0100 1000 This needs 20 bits to represented and the unicode variable size format can support upto 21 bits. Lets break it down into nine high order bit and 12 low order bits. 000100000 001101001000 As you can see the last group is only 11 bits hence we will prefix 0. Final UTF-8 encoding will be 11110000 10100000 10001101 10001000
No comments:
Post a Comment