GitHub - noahdiewald/icu4r: ICU4R - ICU 4.8 bindings for Ruby with collation tailoring. Runs under Ruby 1.8.7 and 1.9.2

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
docs		docs
ext/icu4r/c		ext/icu4r/c
lib		lib
samples		samples
test		test
tools		tools
.gitignore		.gitignore
Gemfile		Gemfile
MIT-LICENSE		MIT-LICENSE
README		README
Rakefile		Rakefile
icu4r.gemspec		icu4r.gemspec

Repository files navigation

== ICU4R - ICU Unicode bindings for Ruby

=== A Note On This Fork

WARNING: This code is not properly tested. It works for me but is likely 
to have serious problems.

NOTE: The UCalendar stuff looks broken and the tests are commented
out. I don't use this stuff so I haven't looked at it yet.

Changes I've made:

* Repackaged the gem
* Did some minor tweaking to get it running under Ruby 1.9.2
* Fixed most tests that were already written and added a few more
* Add tailoring rule support to UCollator

The biggest change is that it is now possible to specify tailoring rules
when initializing a collator. Below is an example abreviated rule for a
Native American language that does not have a locale defined.
  
    c = UCollator.new("&9 < a, A < \304\201, a\314\204, \304\200, A\314\204 < c, C < ae, AE < a\315\236e, A\315\236E".u)

=== Original README content

ICU4R is an attempt to provide better Unicode support for Ruby,
where it lacks for a long time.

Current code is mostly rewritten  string.c from Ruby 1.8.3.

ICU4R is Ruby C-extension binding for ICU library[1] 
and provides following classes and functionality:

* UString:
    - String-like class with internal UTF16 storage;
    - UCA rules for UString comparisons (<=>, casecmp);
    - encoding(codepage) conversion;
    - Unicode normalization;
    - transliteration, also rule-based;

    Bunch of locale-sensitive functions:
    - upcase/downcase;
    - string collation;
    - string search;
    - iterators over text line/word/char/sentence breaks;
    - message formatting (number/currency/string/time);
    - date and number parsing.

* URegexp - unicode regular expressions.

* UResourceBundle - access to resource bundles, including ICU locale data.

* UCalendar - date manipulation and timezone info.

* UConverter - codepage conversions API

* UCollator - locale-sensitive string comparison

== Install and usage

   > ruby extconf.rb
   > make && make check
   > make install

Now, in your scripts just require 'icu4r'.

To create RDoc, run 
   > sh tools/doc.sh

== Requirements

To build and use ICU4R you will need GCC and ICU v3.4 libraries[2].

== Differences from Ruby String and Regexp classes

=== UString vs String

1. UString substring/index  methods use UTF16 codeunit indexes, not code points.

2. UString supports most methods from String class. Missing methods are:
        capitalize, capitalize!, swapcase, swapcase!
        %, center, ljust, rjust
        chomp, chomp!, chop, chop!
        count, delete, delete!, squeeze, squeeze!, tr, tr!, tr_s, tr_s!
        crypt, intern, sum, unpack
        dump, each_byte, each_line
        hex, oct, to_i, to_sym
        reverse, reverse!
        succ, succ!, next, next!, upto
        
3. Instead of String#% method, UString#format is provided. See FORMATTING for short reference.

4. UStrings can be created via String.to_u(encoding='utf8') or global u(str,[encoding='utf8'])
   calls. Note that +encoding+ parameter must be value of String class. 

5. There's difference between character grapheme, codepoint and codeunit. See UNICODE reports for
   gory details, but in short: locale dependent notion of character can be presented using 
   more than one codepoint - base letter and combining (accents) (also possible more than one!), and
   each codepoint can require more than one codeunit to store (for UTF8 codeunit size is 8bit, though
   some codepoints require up to 4bytes). So, UString has normalization and locale dependent break
   iterators.
	
6. Currently UString doesn't include Enumerable module.

7. UString index/[] methods which accept URegexp, throw exception if Regexp passed.

8. UString#<=>, UString#casecmp use UCA rules.

=== URegexp

UString uses ICU regexp library. Pattern syntax is described in [./docs/UNICODE_REGEXPS] and ICU docs.

There are some differences between processing in Ruby Regexp and URegexp:

1. When UString#sub, UString#gsub are called with block, special vars ($~, $&, $1, ...) aren't
   set, as their values are processed through deep ruby core code. Instead, block receives UMatch object,
   which is essentially immutable array of matching groups:
        "test".u.gsub(ure("(e)(.)")) do |match| 
           puts match[0]  # => 'es' <--> $&
           puts match[1]  # => 'e'  <--> $1
           puts match[2]  # => 's'  <--> $2
        end

2. In URegexp search pattern backreferences are in form \n (\1, \2, ...), 
   in replacement string - in form $1, $2, ...

   NOTE: URegexp considers char to be a digit NOT ONLY ASCII (0x0030-0x0039), but 
   any Unicode char, which has property Decimal digit number (Nd), e.g.:
        a = [?$, 0x1D7D9].pack("U*").u * 2
        puts a.inspect_names
        <U000024>DOLLAR SIGN
        <U01D7D9>MATHEMATICAL DOUBLE-STRUCK DIGIT ONE
        <U000024>DOLLAR SIGN
        <U01D7D9>MATHEMATICAL DOUBLE-STRUCK DIGIT ONE
        puts "abracadabra".u.gsub(/(b)/.U, a)
        abbracadabbra
    

3. One can create URegexp using global Kernel#ure function, Regexp#U, Regexp#to_u, or
   from UString using URegexp.new, e.g:
      /pattern/.U =~ "string".u

4. There are differences about Regexp and URegexp multiline matching options:
      t = "text\ntest"
      # ^,$ handling : URegexp multiline <-> Ruby default
      t.u =~ ure('^\w+$', URegexp::MULTILINE)
      => #<UMatch:0xf6f7de04 @ranges=[0..3], @cg=[\u0074\u0065\u0078\u0074]>
      t =~ /^\w+$/
      => 0
      # . matches \n : URegexp DOTALL <-> /m
      t.u =~ ure('.+test', URegexp::DOTALL)
      => #<UMatch:0xf6fa4d88 ...
      t.u =~ /.+test/m

5. UMatch.range(idx) returns range for capturing group idx. This range is in codeunits.

=== References

1. ICU Official Homepage http://ibm.com/software/globalization/icu/ 
2. ICU downloads  http://ibm.com/software/globalization/icu/downloads.jsp
3. ICU Home Page http://icu.sf.net 
4. Unicode Home Page http://www.unicode.org

==== BUGS, DOCS, TODO

The code is slow and inefficient yet, is still highly experimental, 
so can have many security and memory leaks, bugs, inconsistent 
documentation, incomplete test suite. Use it at your own risk.

Bug reports and feature requests are welcome :)

===  Copying

This extension module is copyrighted free software by Nikolai Lugovoi.

You can redistribute it and/or modify it under the terms of MIT License.

Nikolai Lugovoi <meadow.nnick@gmail.com>

About

ICU4R - ICU 4.8 bindings for Ruby with collation tailoring. Runs under Ruby 1.8.7 and 1.9.2

Readme

MIT license

Activity

2 stars

2 watching

0 forks

Report repository

Releases

No releases published

Packages

No packages published

Languages

C 81.4%
Ruby 16.2%
C++ 2.4%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

ext/icu4r/c

ext/icu4r/c

lib

lib

samples

samples

test

test

tools

tools

.gitignore

.gitignore

Gemfile

Gemfile

MIT-LICENSE

MIT-LICENSE

README

README

Rakefile

Rakefile

icu4r.gemspec

icu4r.gemspec

Repository files navigation

About

Releases

Packages

Languages

License

noahdiewald/icu4r

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Languages