The extended internal code specification of Chinese characters—GBK
The extended internal code specification of Chinese characters-- GBK is used to solve the bottleneck problems of the information
exchange for Chinese characters, such as insufficient Chinese characters, coexistence of simplified and traditional Chinese on the same
plane and simplified conversion between code systems, and stride forward the final international unified double-byte character set
standard ISO106461 on the premise of keeping the compatibility of existing application software.
1.The principle of extended Chinese character internal code specification
Compatible with the internal code system standard of "Chinese Character Coding Set for
Information Exchange -- Base Set" -- ie. the national standard GB2312-80.
Support all of the CJK Chinese characters in ISO 10466.1/"CJK unified Chinese Character
Coding Set" ie. The national standard GB13000.1 at the first-level of glossary.
The non-Chinese character symbol covers most of the common non-Chinese character symbols in
"BIG5".
2.Name and abbreviation of the specification
Chinese name: 汉字内码规范
English name:Chinese Internal Code Specification
Abbreviation: GBK (K is the first letter of "extended " Chinese phonetic alphabets)
3.Contents of the specification
As the codepage of a non-UCS (ISO 10646) system, it applies to the processing, exchange, storage, display, input and output of
Chinese information.
Glossary:
- All of the Chinese characters and non-Chinese character symbols in GB2312-80
- Other CJK Chinese characters in input011GB13000.1.
The above items totaled 20902 GB-ized Chinese characters.
- 52 Chinese characters in" General Table for Simplified Characters " not included by GB 13000.1; that is, ,GBK not only
can include all of 7000 Chinese characters in the " Universal Character Table for Contemporary Chinese " but also can
include all of the simplified characters and their corresponding traditional characters in the "General Table for Simplified
Characters ".
- 28 radicals and inportant components in the"KangXi Dictionary " and "Ci Hai" unincluded by GB 13000.1.
- 13 structure symbols of Chinese character
- 139 graphic symbols in " BIG5 " that are unincluded by GB2312--80 and only exist in ISO 10646.1.
- 30 phonetic alphabets with tone and ɑ、ɡ( according to GB 12345-90, printed book) are formally included.
- Chinese character" O " (GB13001.1 code 0x3007 "zero").
- 19 vertical-write punctuations of encoding in GB 12345-90, but the ones without encoding in UCS are temporarily unincluded.
- 21 Chinese characters selected from CJK compatible zone of ISO 10646.1/GB 13000.1, so as to guarantee that some BIG5 (TCACNS11643) files, JIS files and IBM files will not lose information in the conversion of two-way conversion.
- 31 IBM OS/2 special symbols, all of the ones included by ISO 10646.1/GB 13000.1 will be included and self-identified.
Chinese character order
- The Chinese characters in GB2312-80 are still arranged according to the phonetic alphabet, radicals/strokes separately, and according to the original characters in the I level and II level.
- The other CJK Chinese characters in GB13000.1 are arranged according to the UCS code order.
- 80 supplemental Chinese characters and radicals/components separated from the above two kinds of glossary, are arranged according to the page number and character bit of KangXi Dictionary independently.
Code bit allocation
Adopt the rectangle zone of 8140-FEFE in a whole and weed out the line of xx7F, totaled 23940 code bits.
- Chinese character zone: 21008 code bits. GB2312-80 Chinese character zone B0A1 F7FE,6768 code bits, 6763 Chinese characters;rectangle zone of GB13000.1 expanded Chinese character zone 8140-A0FE, weed out xx7F,6080 code bits;AA40 FEA0,weed out xx7F,8160 code bits, in which, 21 CJK compatible Chinese character encoding in FD9C-FE4F; 80 supplemental Chinese characters/radicals/components in FE50-FEA0.
- Graphic symbol zone:1038 code bits. GB2312-80 non-Chinese character zone A1A1 A9FE,846 code bits, besides the original standard characters, it also includes: 10 lowercase Roman numerals supplemented in A2A1-A2AA, 30 phonetic letters and ɑ、ɡ with tone arranged in A8A1-A8C0, 19 vertical-write symbols arranged in A6E0-A6F5. GB13000.1 expanded non-Chinese character zone A840-A9A0,weed out xx7F,192 code bits,BIG5 non-Chinese character, structure symbol and " O "arranged in this zone.
- Custom zone: 1894 code bits,AAA1-AFFE rectangle zone, 564 code bits;F8A1-FEFE rectangle zone, 658 code bits,A140-A7A0 rectangle zone, 672 code bits( weed out xx7F).
The corresponding relationship between GBK and GB 13000.1
- All of the characters in Chinese character zone and graphic symbol zone are in one-to-one correspondence with characters encoded in GB 13000.1.
- 52 supplemental Chinese characters, 28 radicals/components and 13 structure symbols are in temporarily correspondence with GB 13000.1 exclusive zone( Private Use Area, E000-F8FE), if these characters were formally included in ISO 10646/ GB 13000 in the future, the corresponding modification of this specification will be made.
- The phonetic letters with tone are in correspondence with Latin encoded characters of A_Zone in GB 13000.1; two letters which are not in correspondence with GB 13000.1, will remain to be applied for code bits to SC2/WG2.
GBK font
- GBK font need to be in correspondence with ISO 10646.1/GB 13000.1.
- In the overall framework of CJK Chinese character self-identity rule, select the(GB-ized) Chinese character font behind the " without repeat code conformality ".