[Oracle] talk about the character set used in Oracle database, not just random code

Posted by ljzxtww on Mon, 24 Jan 2022 09:12:23 +0100

1, Foreword

Let's look at an interesting case first

The above sql queries two fields a and b, both of which are "Zhang San", and uses the length function to check that the length is 2.
However, when you see the following sql output results, your first reaction is likely to be:

"How is this possible?"




In fact, as like as two peas, you can see that the two "Zhang three" you see is exactly the same.
But why are a and b not equal?
This is because they are composed of different components, which is the character set

2, What is a character set?

Baidu Encyclopedia
In short,

Character is the general name of various characters and symbols, including national characters, punctuation marks, graphic symbols, numbers, etc.
A Character set is a collection of multiple characters

As we all know, the bottom layer of the computer system is composed of binary 1 and 0, and these 1 and 0 are just "on" and "off" of the electrical signal, and there are no characters at all. In fact, the character set can be understood as an indexed dictionary. According to the specified binary data, look up the character at the corresponding position of the dictionary, and then display the word in the selected font from the character set.

For example, the binary data "1011000010100001" (hexadecimal "B0A1") corresponds to the Chinese character "ah" in the GB2312 character set.

Even English letters and numbers must also rely on character sets. For example, binary data "1100001" (hexadecimal "61") represents lowercase letter "a". Of course, in almost all character sets, this binary data represents a.

Therefore, for all our current computer systems, if we want to recognize characters, we must use the binary data of this character to check a specific character set and find the corresponding characters. If we find the wrong character set, what we find may be garbled.

3, Character set of ORACLE Database

During oracle database installation, there will be an option to select the character set of the database, which most DBA s know and needless to say.
However, many people are often confused when they learn about the oracle character set through articles on the Internet:

What is AL32UTF8? Why not utf8?
Simple machine set in environment variable_ CHINA. What is ZHS16GBK? Why not set GBK directly?
Is the character set found in each view of the database client or database?

1. Character set code

The character set code in ORACLE is not equal to the current common character set name. Let's take a look at the official information
Comparison table between ORACLE character set and general character set
In fact, according to the "Description" in this table, the names of these common character sets are too long and irregular to be used as code; There are also some special cases, such as "JA16EUCTILDE" and "JA16EUC", which are actually "EUC 24 bit Japanese". Only the wavy lines are different, they are divided into two character sets.
Therefore, oracle must name these character sets uniformly again, so there are character set codes such as "AL32UTF8" and "ZHS16GBK"

II.NLS_LANG

In the system environment variable (or registry) "NLS_LANG", values like the following are often configured

SIMPLIFIED CHINESE_CHINA.ZHS16GBK
AMERICAN_AMERICA.AL32UTF8

In fact, this environment variable consists of three parts

<Language>_<territory>.<character>

Namely

Language_ Area character set

Only the last section is the true representation of the character set

Therefore, "simple chinese_china.zhs16gbk" means: "simplified Chinese"_ "China region" "16 bit GBK simplified Chinese"

Similarly, "american_american. AL32UTF8" means "American English"_ "U.S. region." "Unicode 12.1 universal character set UTF-8 encoding scheme".

Of course, not all of these three are required in the environment variables. You can even use ". AL32UTF8" to indicate that the default language and region of the database are used, but the client uses the character set you specify; Use "_american." Only specify the client region. The language uses the default language of the region, and the character set uses the character set corresponding to the default language of the region.

---Query the language of Belgium
DECLARE
   retval UTL_I18N.STRING_ARRAY;
   cnt    INTEGER;
BEGIN
   retval   :=  UTL_I18N.GET_LOCAL_LANGUAGES('BELGIUM');
   for i in  0..retval.count-1 loop
   DBMS_OUTPUT.PUT_LINE(retval(i));
   END LOOP;
END;
/
---Query the default character set of French
select UTL_I18N.GET_DEFAULT_CHARSET('French') from dual;

---from ISO Standard language code conversion ORACLE Languages and regions in
select UTL_I18N.MAP_LANGUAGE_FROM_ISO('en_US') ,
       UTL_I18N.MAP_TERRITORY_FROM_ISO('en_US') from dual;

---take ORACLE Language and region conversion to standard ISO Standard language code
select UTL_I18N.MAP_LOCALE_TO_ISO('American','America') from dual;

In the actual user scenario, a region can have more than one language, and a character set can also represent multiple languages. By flexibly configuring this environment variable, you can make the database more compatible with the time zone, language, currency, text, etc. of each region.

However, you may ask:
Don't they all say that the environment variables of the database should be consistent with those of the client?

Yes, yes, it is recommended to be consistent, but it is not absolute. At present, there are many companies in many countries and languages. When using the same database, there must be different needs.

It is divided into three parts

A. Language

First, for the language, let's make a test like this:

  1. When installing the database, the language is "AMERICAN", and then the "NLS_LANG" of the client environment variable is set to "simple machine_china.al32utf8". Open the client software and execute an incorrect sql after connecting to the database
  2. Then change the "NLS_LANG" of the client environment variable to "AMERICAN_CHINA.AL32UTF8", reopen the client software, and execute the same sql after connecting to the database

Yes, the language of the client environment variable is used to do this. It is used to display error reports or prompt messages in different languages on the client. You can even set them to Japanese or French. These system prompt messages are recorded in multiple languages in the database. In addition, you can also convert the corresponding system prompt information into a language not set by the client through the plsql package "UTL_LMS" provided in the database.

DECLARE
   s varchar2(200);
   i  pls_integer;
BEGIN
   i:= utl_lms.get_message(26052, 'rdbms', 'ora', 'french', s);
   dbms_output.put_line('French : '||s);
   i:= utl_lms.get_message(26052, 'rdbms', 'ora', 'AMERICAN', s);
   dbms_output.put_line('American English  is: '||s);
   i:= utl_lms.get_message(26052, 'rdbms', 'ora', 'SIMPLIFIED CHINESE', s);
   dbms_output.put_line('Simplified Chinese is: '||s);
END;
/

Output is:

French: type% D non pris en charge pour l'expression SQL sur la Colonne% s
American English: unsupported type% d for SQL expression on column% s
English: SQL expression type% d at column% s is not supported.

Of course, we also have to see whether the current character set supports the current language. Please refer to the official documents
Comparison table of ORACLE language and supported character sets

B. Region

For regions, in addition to the function of taking the default language of the region when no language is set, it also affects the time zone, currency and date format. This has little to do with this article, so it is omitted.

C. Character set

Next is the last parameter, the character set

Through the articles on the network and my own exploration in the database, I get the following sql, which can query language, region and character set

select * from nls_database_parameters where parameter in ('NLS_TERRITORY','NLS_LANGUAGE','NLS_CHARACTERSET');
select * from nls_instance_parameters where parameter in ('NLS_TERRITORY','NLS_LANGUAGE','NLS_CHARACTERSET');
select * from nls_session_parameters where parameter in ('NLS_TERRITORY','NLS_LANGUAGE','NLS_CHARACTERSET');
select * from v$parameter;
select * from v$nls_parameters;
select * from v$system_parameter where name like 'nls%';
select name,substr(value$, 1, 64) from x$props where name like 'NLS%';
select userenv('language') from dual;
select sys_context('USERENV','LANGUAGE') from dual;

However, there is a problem. These sql can let us know the language and region used in the database and the client respectively, but for the character set, it is either not available or displayed as the database character set, and the character set set set in the client environment variable cannot be found through any sql!
This is also described in the official Oracle documentation
FAQ-NLS-LANG

SELECT USERENV ('language') FROM DUAL; gives the session's _ but the DATABASE character set not the client, so the value returned is not the client's complete NLS_LANG setting!

SELECT USERENV (‘language’) FROM DUAL; Get the language of the current session_ The region and database character set are not the client character set, so the returned value is not the complete client NLS_LANG setting!

Let's not discuss why ORACLE does not provide the query method of client character sets in the database. Let's focus on what problems will occur if the character sets are different.

I created a table with two fields: row number and value. Then, four text files are written, all of which insert into a record into this table. The line numbers are 1, 2, 3 and 4 respectively, and the values are "Zhang San". The test scenarios and results are summarized as follows

insert data

Line numberDatabase languageDatabase character setClient languageClient character setInsert the file code of sqlHexadecimal data in the databaseVisible characters in plsql developer
1AMERICANAL32UTF8SIMPLIFIED CHINESEZHS16GBKUTF-8E5AFAEE78AB1E7AC81For three years
2AMERICANAL32UTF8SIMPLIFIED CHINESEZHS16GBKANSIE5BCA0E4B889Zhang San
3AMERICANAL32UTF8SIMPLIFIED CHINESEAL32UTF8UTF-8E5BCA0E4B889Zhang San
4AMERICANAL32UTF8SIMPLIFIED CHINESEAL32UTF8ANSID5C5C8FDZhang San

(actually, when unicode is enabled in the plsql developer option, the client character set does not affect its data display)

Query data with SQL plus

Client character setLine numberVisible characters (chcp 936)Visible characters (chcp 65001)
AL32UTF81IgnitionFor three years
AL32UTF82For three yearsZhang San
AL32UTF83For three yearsZhang San
AL32UTF84Zhang SanZhang San
Client character setLine numberVisible characters (chcp 936)Visible characters (chcp 65001)
ZHS16GBK1For three yearsZhang San
ZHS16GBK2Zhang San
ZHS16GBK3Zhang San
ZHS16GBK4??

According to the above test, the following points can be obtained

  1. When inserting data, the character set of the data file shall be consistent with the client character set to avoid garbled code (the corresponding text needs to exist in both character sets), even if it is inconsistent with the database character set (for example, lines 2 and 3 are correct and consistent)
  2. When inserting data, when the client character set is inconsistent with the data file character set, it may cause garbled code (line 1) or no garbled code (line 4), but whether the garbled code is just "display effect", its value is wrong in a sense
  3. When querying, the client character set is consistent with the database character set, which can improve the probability of correctly displaying text
  4. When querying, the code page (chcp) is consistent with the client character set, which can improve the probability of displaying text correctly
  5. Code page (chcp) and data file coding are one thing, that is, only three variables are involved in this test, database character set, client character set and data coding (affected by code page or file coding)
  6. To minimize the occurrence of "garbled code", the database character set, client character set and data coding should be consistent
  7. If the three cannot be consistent, for example, when the database character set is uncontrollable, try to keep the client character set and data coding consistent
  8. If the database character set is inconsistent with the client character set, for example, if the database character set is ZHS16GBK, but the client character set is AL32UTF8, some words may not be converted when the client inserts data (UTF-8 has more words than GBK), but the reverse problem is not great, as long as the character sets of all clients accessing the same data are unified

3, Restoring garbled code in Oracle

It is necessary to ensure the consistency of the character set in advance to avoid garbled code. However, if the consistency of the character set is not guaranteed in advance, resulting in garbled code, how to save this batch of garbled data afterwards?

Some garbled codes can be restored in the following ways

select utl_i18n.raw_to_char(UTL_I18N.string_to_raw('For three years', 'ZHS16GBK'),
                            'AL32UTF8')
  from dual;


However, the above only takes ZHS16GBK and AL32UTF8 as examples. There are many actual scenes, such as Japanese, Korean, French, etc. it is difficult to judge which character set of a garbled code turns into which character set.

Therefore, I developed a plsql package to perform exhaustive conversion and output a table. Users can visually check how the garbled code is generated (in order to reduce the number of data output, only for the garbled code participated by one party with the Asian language character set)

create or replace package charset_util_pkg is
  type convert_any_charset_type is RECORD(
    original_str varchar2(4000),
    original_HEX RAW(32767),
    to_str       varchar2(4000),
    to_hex       raw(32767),
    from_charset varchar2(50),
    to_charset   varchar2(50),
    sql_text     varchar2(200));
  type convert_any_charset_table is table of convert_any_charset_type;

  function convert_any_charset(original_str varchar2)
    return convert_any_charset_table
    PIPELINED;
end charset_util_pkg;
/
create or replace package body charset_util_pkg is
  charsetlist ora_mining_varchar2_nt := ora_mining_varchar2_nt('JA16EUC',
                                                               --'JA16EUCTILDE',
                                                               'JA16SJIS',
                                                               --'JA16SJISTILDE',
                                                               'KO16MSWIN949',
                                                               'TH8TISASCII',
                                                               'VN8MSWIN1258',
                                                               'ZHS16GBK',
                                                               'ZHT16HKSCS',
                                                               --'ZHT16MSWIN950',
                                                               'ZHT32EUC',
                                                               'BLT8ISO8859P13',
                                                               'BLT8MSWIN1257',
                                                               'CL8ISO8859P5',
                                                               'CL8MSWIN1251',
                                                               'EE8ISO8859P2',
                                                               'EL8ISO8859P7',
                                                               'EL8MSWIN1253',
                                                               'EE8MSWIN1250',
                                                               'NE8ISO8859P10',
                                                               'NEE8ISO8859P4',
                                                               'WE8ISO8859P15',
                                                               'WE8MSWIN1252',
                                                               'AR8ISO8859P6',
                                                               'AR8MSWIN1256',
                                                               'IW8ISO8859P8',
                                                               'IW8MSWIN1255',
                                                               'TR8MSWIN1254',
                                                               'WE8ISO8859P9',
                                                               'AL32UTF8');

  ASIA_charsetlist ora_mining_varchar2_nt := ora_mining_varchar2_nt('JA16EUC',
                                                                   -- 'JA16EUCTILDE',
                                                                    'JA16SJIS',
                                                                   -- 'JA16SJISTILDE',
                                                                    'KO16MSWIN949',
                                                                    'TH8TISASCII',
                                                                    'VN8MSWIN1258',
                                                                    'ZHS16GBK',
                                                                    'ZHT16HKSCS',
                                                                    --'ZHT16MSWIN950',
                                                                    'ZHT32EUC');

  function convert_any_charset(original_str varchar2)
    return convert_any_charset_table
    PIPELINED is
    to_str varchar2(4000);
  begin
    for rec in (select f.column_value from_charset,
                       t.column_value to_charset
                  from table(charsetlist) f, table(charsetlist) t
                 where f.column_value <> t.column_value
                   AND (F.column_value IN
                       (SELECT column_value FROM TABLE(ASIA_charsetlist)) OR
                       T.column_value IN
                       (SELECT column_value FROM TABLE(ASIA_charsetlist)))) loop
    
      begin
        to_str := utl_i18n.raw_to_char(UTL_I18N.string_to_raw(original_str,
                                                              rec.from_charset),
                                       rec.to_charset);
      exception
        when others then
          begin
            to_str := utl_i18n.raw_to_char(UTL_I18N.string_to_raw(original_str ||
                                                                  chr(0),
                                                                  rec.from_charset),
                                           rec.to_charset);
          exception
            when others then
              to_str := 'convert error!';
          end;
      end;
      if to_str in (rpad(chr(63), length(original_str), chr(63)),
                    rpad(UTL_RAW.cast_to_varchar2('C2BF'),
                         length(original_str),
                         UTL_RAW.cast_to_varchar2('C2BF'))) then
        CONTINUE;
      end if;
      PIPE ROW(convert_any_charset_type(original_str,
                                        UTL_RAW.cast_to_raw(original_str),
                                        to_str,
                                        UTL_RAW.cast_to_raw(to_str),
                                        rec.to_charset,
                                        rec.from_charset,
                                        'utl_i18n.raw_to_char(UTL_I18N.string_to_raw(%s,''' ||
                                        rec.from_charset || '''),''' ||
                                        rec.to_charset || ''')'));
    end loop;
  end;

end charset_util_pkg;
/
select A.* from charset_util_pkg.convert_any_charset('For three years') A;

Or limit the current database character set to reduce entries

select A.*
  from charset_util_pkg.convert_any_charset('For three years') A
 where from_charset =
       (select value
          from nls_database_parameters
         where parameter = 'NLS_CHARACTERSET');

Or use columns as incoming parameters

select A.*
  from test_20220122_1 b,charset_util_pkg.convert_any_charset(b.b) A
 where b.b='For three years' and  from_charset =
       (select value
          from nls_database_parameters
         where parameter = 'NLS_CHARACTERSET');

In addition, you can copy its corresponding SQL from the results_ Textto get the correct conversion code quickly.

summary

Through the above research, it can be seen that if the developed application project is garbled, the application developer should take the primary responsibility, because the client character set and data code are specified by the developer. In a database with only application input data, the garbled code has little relationship with the database character set (unless a partial character set is specified during database installation).

If you don't pay attention to the unity of character sets during the development of application projects, various strange problems will appear, such as finding data without conditions, copying that value as a condition and not finding it; And all kinds of strange garbled code.

Back to the case at the beginning, in fact, the values of a and b are the values of lines 3 and 4 in the above test. Although the display is the same, in fact, one is UTF8 coding and the other is GBK coding.

Test data and sql attached at the beginning:

create table test_20220122_1 as          
select utl_raw.cast_to_varchar2('E5BCA0E4B889') a,
       utl_raw.cast_to_varchar2('D5C5C8FD') b 
from dual;

select a,
       b,
       length(a),
       length(b),
       lengthb(a),
       lengthb(b),
       UTL_RAW.cast_to_raw(a),
       UTL_RAW.cast_to_raw(B)
  from test_20220122_1;

Note: the above is only for ORACLE database, and the situation of other databases will be different

Topics: Database Oracle