Example: LC_COLLATE Locale Category Definition

The following lines are a sample definition of a LC_COLLATE locale category, as they might appear in a locale source file.

The category source is on the left of the page and explanations are to the right, followed by some example string comparisons in the resulting very strange locale.

( widen viewing area so this line does not wrap )  
LC_COLLATE
# ARTIFICIAL COLLATE CATEGORY
 
# collating elements
collating-element <ch> from "<c><h>"
collating-element <Ch> from "<C><h>"
collating-element <eszet> from "<s><z>"
Collating elements
  • character <c> followed by <h> collate as one entity named <ch>
  • character <C> followed by <h> collate as one entity named <Ch>
  • character <s> followed by <z> collate as one entity named <eszet>
#collating symbols for relative order definition  
collating-symbol <LOW>
collating-symbol <UPPER-CASE>
collating-symbol <LOWER-CASE>
collating-symbol <NONE>
Collating symbols <LOW>, <UPPER-CASE>, <LOWER-CASE> and <NONE> are defined to be used in relative order definition
order_start forward;backward;forward
<NONE>
Up to 3 string comparisons are defined:
  • first pass starts from the beginning of the strings
  • second pass starts from the end of the strings, and
  • third pass starts from the beginning of the strings
<LOW>
<UPPER-CASE>
<LOWER-CASE>
The collating weights are defined such that:
  • <LOW> collates before <UPPER-CASE>,
  • <UPPER-CASE> collates before <LOWER-CASE>,
  • <LOWER-CASE> collates before <NONE>
UNDEFINED IGNORE;IGNORE;IGNORE All characters for which collation is not specified here are ordered after <NONE>, and before <space> in ascending order according to their encoded values
<space>
...
<quotation-mark>
All characters with an encoded value larger than the encoded value of <space> and lower than the encoded value of <quotation-mark> in the current encoded character set, collate in ascending order according to their values
<a> <a>;<NONE>;<LOWER-CASE> <a> has a:
  • primary weight of <a>
  • secondary weight <NONE>
  • tertiary weight of <LOWER-CASE>
<a-acute> <a>;<a-acute>;<LOWER-CASE> <a-acute> has a:
  • primary weight of <a>
  • secondary weight of <a-acute> itself
  • tertiary weight of <LOWER-CASE>
<a-grave> <a>;<a-grave>;<LOWER-CASE> <a-grave> has a:
  • primary weight of <a>
  • secondary weight of <a-grave> itself
  • tertiary weight of <LOWER-CASE>
<A> <a>;<NONE>;<UPPER-CASE> <A> has a:
  • primary weight of <a>
  • secondary weight <NONE>
  • tertiary weight of <UPPER-CASE>
<A-acute> <a>;<a-acute>;<UPPER-CASE> <A-acute> has a:
  • primary weight of <a>
  • secondary weight of <a-acute>
  • tertiary weight of <UPPER-CASE>
<A-grave> <a>;<a-grave>;<UPPER-CASE> <A-grave> has a:
  • primary weight of <a>,
  • secondary weight of <a-grave>
  • tertiary weight of <UPPER-CASE>
<ch> <ch>;<NONE>;<LOWER-CASE> <ch> has a:
  • primary weight of <ch>
  • secondary weight of <NONE>
  • tertiary weight of <LOWER-CASE>
<Ch> <ch>;<NONE>;<UPPER-CASE> <Ch> has a:
  • primary weight of <ch>
  • secondary weight of <NONE>
  • tertiary weight of <UPPER-CASE>
<s> <s>;<s>;<LOWER-CASE> <s> has a:
  • primary weight of <s> itself
  • secondary weight <s>
  • tertiary weight of <LOWER-CASE>
<eszet> "<s><s>";"<eszet><s>";<LOWER-CASE> <eszet> has a:
  • primary weight determined by replacing each occurrence of <eszet> with the sequence of two <s>'s and using the weight of <s>,
  • secondary weight determined by replacing each occurrence of <eszet> with the sequence of <eszet> and <s> and using their weights,
  • tertiary weight is the relative position of <LOWER-CASE>.
<z> <z>;<NONE>;<LOWER-CASE> <z> has a:
  • primary weight of <z> itself
  • secondary weight <NONE>
  • tertiary weight of <LOWER-CASE>
order_end  

Comparison of Strings

  • Compare "aAch" and "AaCh"
  • Compare "a1sz" and "a2ss"
  • Compare "aAch" and "AaCh"
    In a locale built from the above LC_COLLATE definition, the comparison of the strings s1="aAch" and s2="AaCh" is processed as follows:

    1. s1=> "aA<ch>", and s2=> "Aa<Ch>" 
    2. First pass:
      1. Substitute the elements of the strings with their primary weights:
        s1=> "<a><a><ch>", s2=> "<a><a><ch>"
      2. Compare the two strings starting with the first element.
        They are equal.
    3. Second pass:
      1. Substitute the elements of the strings with their secondary weights:
        s1=> "<NONE><NONE><NONE>", s2=>"<NONE><NONE><NONE>"
      2. Compare the two strings from the last element to the first.
        They are equal.
    4. Third pass:
      1. Substitute the elements of the strings with their third level weights:
        s1=> "<LOWER-CASE><UPPER-CASE><LOWER-CASE>"
        s2=> "<UPPER-CASE><LOWER-CASE><UPPER-CASE>"
      2. Compare the two strings starting from the beginning of the strings:
        s2 compares lower than s1, because <UPPER-CASE> is before <LOWER-CASE>.

    Compare "a1sz" and "a2ss"
    In a locale built from the above LC_COLLATE definition above, the comparison of the strings s1="a1sz" and s2="a2ss" is processed as follows:

    1. s1=> "a1<eszet>" and s2= "a2ss";
    2. First pass:
      1. Substitute the elements of the strings with their primary weights:
        s1=> "<a><s><s>", s2=> "<a><s><s>"
      2. Compare the two strings starting with the first element.
        They are equal.
    3. Second pass:
      1. Substitute the elements of the strings with their secondary weights:
        s1=> "<a-acute><eszet><s>", s2=>"<a-grave><s><s>"
      2. Compare the two strings from the last element to the first .
        <s> is before <ezset>.



    Internationalization
    Localization and Locales


    Customize a Locale


    LC_COLLATE Locale Category
    LC_COLLATE Collating Rules
    LC_COLLATE Collating Keywords
    Locale Categories
    Locale Source Files