Unicode

Unicode is supposed to make things better - however it is still not very widely used. The Linux world has started to get to grips with this - And Oracle have also been very good in handling this.

English in Unicode

The "English" or better ASCII character set can be represented in Unicode - despite it only having 255 (actually less 26*2+10+20) letters

  • 26 Letters
    • Upper
    • Lower
  • 10 Arabic Numbers
  • 20 Special chars i.e. ,;:& etc

The Letter A in

Base Value
10 65
16 0x41
64 0x00 0x41

Arabic in Unicode

The Arabic alphabet is much more complex - not only due to start-middle-ending forms - but also due to the diacritics that are applied to the letters also.

The Alphabet is 28 Characters long (Farsi is 29 - the extra charactre being Peh)

Aliph

Letter Aliph - or A It represents 27% of all letters in Arabic - making Arabic Grmatically weak from a Crypto-Analysis point of view.

Base Value
10 1575
64 0x06 0x27

Unicode Data Decoding

The following is a small program that takes an Arabic Sentence (a PANDROME - one with all the letters in it) - and sends it to UNICODE - then CONVERTS back to UNICODE.

IT IS WRITTEN FOR UNDERSTANDING EASE - NOT FOR SPEED/PRODUCTION

#Tim
#July 7

#Some Poetry I think
#This is UNICODE
phrase="هلا سكنت بذي ضغثٍ فقد زعموا — شخصت تطلب ظبياً راح مجتازا"

def uni_chop(ucode):
    #Format as 4 Digit Hex
    out="%0.4X" % ord(ucode)
    #Return a Tuple - of MSB LSB
    return (out[:2],out[2:])


fake_hex=""
for l in phrase:
     #Read 1 unicode letter at a time
     tup_hex=uni_chop(l)
     print("%s %s "%(tup_hex[0],tup_hex[1]))
     #Build up HEX DATA String
     fake_hex=fake_hex+tup_hex[0]+tup_hex[1]

fake_hex=""
for l in phrase:
     #Read 1 unicode letter at a time
     tup_hex=uni_chop(l)
     print("%s %s "%(tup_hex[0],tup_hex[1]))
     #Build up HEX DATA String
     fake_hex=fake_hex+tup_hex[0]+tup_hex[1]

print(str.format("Fake Hex is ... <{}>",fake_hex))

#
#Lets Reassemble
#
out_str=""
for loop in range(0,len(fake_hex),4):
    chop=fake_hex[loop:loop+4]
    front=chr(int("0x"+chop,base=16))
    print("%s"%str(front))
    out_str=out_str+front

print(str.format("Back again is ... <{}>",out_str))

Running it

This is what the output should be

06 47 
06 44 
06 27 
00 20 
06 33 
06 43 
06 46 
06 2A 
00 20 
06 28 
06 30 
06 4A 
00 20 
06 36 
06 3A 
06 2B 
06 4D 
00 20 
06 41 
06 42 
06 2F 
00 20 
06 32 
06 39 
06 45 
06 48 
06 27 
00 20 
20 14 
00 20 
06 34 
06 2E 
06 35 
06 2A 
00 20 
06 2A 
06 37 
06 44 
06 28 
00 20 
06 38 
06 28 
06 4A 
06 27 
06 4B 
00 20 
06 31 
06 27 
06 2D 
00 20 
06 45 
06 2C 
06 2A 
06 27 
06 32 
06 27 
Fake Hex is ... <0647064406270020063306430646062A002006280630064A00200636063A062B064D002006410642062F0020063206390645064806270020201400200634062E0635062A0020062A063706440628002006380628064A0627064B002006310627062D00200645062C062A062706320627>
ه
ل
ا

س
ك
ن
ت

ب
ذ
ي

ض
غ
ث
ٍ

ف
ق
د

ز
ع
م
و
ا

—

ش
خ
ص
ت

ت
ط
ل
ب

ظ
ب
ي
ا
ً

ر
ا
ح

م
ج
ت
ا
ز
ا
Back again is ... <هلا سكنت بذي ضغثٍ فقد زعموا — شخصت تطلب ظبياً راح مجتازا>