SlideShare a Scribd company logo
THE GOOD, THE BAD,
     AND THE UGLY




What Happened to Unicode and PHP 6


  Andrei Zmievski ❖ PHP Community Conference
ABOUT 1 YEAR AGO…

“Hello PHP 5.4, open for all new stuff.” — Jani
TIME OF DEATH

March 11, 11:09:37 2010 GMT
5 YEARS EARLIER…
       PHP 5.0.0
 released in July 2004
5 YEARS EARLIER…
         Firefox 1.0
released in November 2004
5 YEARS EARLIER…
             Chrome
not even a twinkle in Google’s eye
5 YEARS EARLIER…
      Unicode
    version 4.0.1
WHAT IS UNICODE?
  and why do I need it?
Unicode
    …is a computing industry
   standard for the consistent
  encoding, representation and
handling of text expressed in most
 of the world's writing systems.
Unicode

 provides a unique number
    for every character:

no matter what the platform,
no matter what the program,
no matter what the language.
UNICODE STANDARD

❖   Developed by the Unicode Consortium
❖   Covers all major living scripts
❖   Version 6.0 has 109,000+ characters
❖   Capacity for 1 million+ characters
❖   Widely supported by standards & industry
FEATURES


❖   Rich property set for every character

❖   Standard, unified encodings: UTF-8/16/32

❖   Extensive rules and documents for implementation

❖   Everything works, as long as everyone follows the rules
UNICODE != I18N



❖   Unicode simplifies development

❖   Unicode does not fix all internationalization problems
TIME FORMATS
❖   USA:    4:00  P.M.

❖   France: 16.00

❖   Japan: 1600  

❖   Don’t forget to identify the time zone
CURRENCY

❖   Symbol placement
                                     US  $12.34
❖   Symbol length (1-15)            12.345,67  €
❖   Number width                       12$34€
❖   Number precision:                   ¥123
    ‣   Spain, Japan   –0
    ‣   Mexico, Brazil – 2
    ‣   Egypt, Iraq    –3
SORTING
❖   Swedish:        z<ö
❖   German:         ö<z
❖   Dictionary:     öf < of
❖   Phonebook:      of < öf
❖   Upper-first:     A<a
❖   Lower-First:    a<A
❖   Contractions:   H < Z, but CH > CZ
❖   Expansions:     OE < Œ < OF
CLDR


❖   Hosted by Unicode Consortium

❖   Latest release: December 2010 (CLDR 1.9)

❖   516 locales, with 187 languages and 166 territories
WHY WEB NEEDS UNICODE
MOJIBAKE
MOJIBAKE

noun: phenomenon of incorrect, unreadable
characters shown when computer software
fails to render a text correctly according to
its associated character encoding.
MOJIBAKE
MOJIBAKE
I UNICODE,
YOU UNICODE
I | UNICODE,
YOU | UNICODE
Helgi
Helgi
Helgi
  Þormar
Þorbjörnsson
ISLTHORP
         or

Mr. SECURITY OVERRIDE
Joel
Joël
Joël
Joël
Joël
WEAKEST LINK
PRINCIPLE APPLIES
WHY PHP NEEDS UNICODE
PHP


❖   Essential Web platform

❖   Since Web needs Unicode…

❖   …so does PHP

❖   Do not want to be the weakest link
THE PROJECT
THE PROJECT


❖   Launched in February 2005 by me at Yahoo

❖   Small group from Yahoo, Zend, and PHP development
    community

❖   Design before code
UNICODE SUPPORT


❖   Everywhere:

    ‣   in the engine

    ‣   in the extensions

    ‣   in the API
UNICODE SUPPORT

❖   Native and complete

    ‣   no hacks

    ‣   no mishmash of external libraries

    ‣   no missing locales

    ‣   no language bias
ICU LIBRARY
       International Components for Unicode
✓   Unicode Character Properties       ✓ Formatting: Date/Time/
✓   Unicode String Class & text          Numbers/Currency
    processing                         ✓ Cultural Calendars & Time

✓   Text transformations                 Zones
    (normalization, upper/lowercase,   ✓ (230+) Locale handling
    etc)                               ✓ Resource Bundles
✓   Text Boundary Analysis             ✓ Transliterations (50+ script
    (Character/Word/Sentence Break       pairs)
    Iterators)
                                       ✓ Complex Text Layout for Arabic,
✓   Encoding Conversions for 500+        Hebrew, Indic & Thai
    legacy encodings
                                       ✓ International Domain Names
✓   Language-sensitive collation         and Web addresses
    (sorting) and searching
                                       ✓ Java model for locale-
✓   Unicode regular expressions          hierarchical resource bundles.
✓   Thread-safe                          Multiple locales can be used at a
                                         time
THE PROJECT


❖   Development was in a separate repository

❖   Merged into PHP tree once the basics were working

❖   Initially slated for 5.x

❖   Extensive changes necessitated a major version bump
PHP 6 = PHP 5 + Unicode
PHP 5 = PHP 6 - Unicode
Unicode = PHP 6 - PHP 5
PHP 6
PHP 6


   6
STRING TYPES

❖   Unicode

    ‣   text

    ‣   default for literals, etc

❖   Binary

    ‣   bytes

    ‣   everything ∉ Unicode type
Conversions
  Dataflow                                                  streams




                                                        s c
                                                      ng ifi
                                                    di ec
                                PHP




                                                  co -sp
                                              en am
                                Unicode




                                                 re
                                              st
                                strings
                          runtime encoding
request                                                       response
          HTTP input            binary       HTTP output
           encoding             strings       encoding
                          ng




                                          fil co
                            i




                                            es d
                         od




                                             en

                                              ys in
                       nc




                                                te g
                    te




                                                  m
                rip
               sc




             scripts                         filesystem
STRINGS
❖   String literals are Unicode

❖   String offsets work on code points

        $str  =  "   ";
 
   //  2  code  points
        echo  $str[1];
 
    //  result  is     
        $str[0]  =  ' ';
 //  full  string  is  now  
IDENTIFIERS
❖   Unicode identifiers are allowed

      class                       {  
            function  ᓱᓴᓐ  ᐊᒡᓗᒃᑲᖅ()
 
 {  ...  }
            function  !வா$  கேனச)

()    {  ...  }
            function  འ"ག་%ལ།()
 
 
         {  ...  }
      }  

      $              =  array();
      $            ['‫  =  ]'ַרעְיולּוחַ  ׁשָנָה‬new       ;
FUNCTIONS
❖ Functions understand Unicode text and apply
  appropriate rules
❖ i.e. case manipulation



    $str  =  strtoupper("fußball");
 //  result  is  FUSSBALL

    $str  =  strtolower("ΣΕΛΛΑΣ");
   //  result  is  σελλάς  
TRANSLITERATION
$names  =  "  
    김,  국삼  
    김,  명희  
         ,         
           ,          
    Горбачев,  Михаил  
    Козырев,  Андрей  
    Καφετζόπουλος,  Θεόφιλος  
    Θεοδωράτου,  Ελένη  
";  
$r  =  strtotitle(str_transliterate($names,  "Any",  "Latin"));

Gim,  Gugsam
Gim,  Myeonghyi
Takeda,  Masayuki
Oohara,  Manabu
Gorbačev,  Mihail
Kozyrev,  Andrej
Kaphetzópoulos,  Theóphilos
Theodōrátou,  Elénē
PECL/INTL
FEATURES
❖   Locales
❖   Collation
❖   Number and Currency Formatters
❖   Date and Time Formatters
❖   Time Zones
❖   Calendars
❖   Message Formatter
❖   Choice Formatter
❖   Resource Handler
❖   Normalization
COLLATION
sorting
$strings  =  array(  
                "cote",  "côte",  "Côte",  "coté",  
                "Coté",  "côté",  "Côté",  "coter");  
$coll  =  new  Collator("fr_FR");  
$coll-­‐>sort($strings);

result
cote
côte
Côte
coté
Coté
côté
Côté
coter
NUMBER FORMATTING
                123456.789  in  en_US
❖   NumberFormatter::DECIMAL
    123456.789

❖   NumberFormatter::CURRENCY
    $123,456.79

❖   NumberFormatter::ORDINAL
    123,457th

❖   NumberFormatter::SPELLOUT
    one hundred and twenty-three thousand, four hundred
    and fifty-six point seven eight nine
MESSAGE FORMATTING

with modifiers
$pattern  =  “On  {0,date,full}  you  received
                        {1,number,#,##0.00}  emails.”;
$args  =  array(time(),  1184);  
$fmt  =  new  MessageFormatter(‘en_US’,  $pattern);
echo  $fmt-­‐>format($args);

result
On  Tuesday,  November  22,  2007  you  received  
1,184.00  emails.
POSTMORTEM
WHAT WENT RIGHT
1. RAISED AWARENESS


❖   Spoke at multiple conferences about the project

    ‣   including Unicode Conference

❖   Shoved Unicode down people’s throats at every
    opportunity
2. CHOSE THE RIGHT TECH


❖   ICU library had everything we needed

❖   Low- and high-level functionality

❖   Good support from its developers
3. UNIT TESTS


❖   Every function handling strings had to be ported

❖   Unit tests showed us where things broke

❖   Also easy to track progress
4. PECL/INTL EXTENSION


❖   A lot of i18n/l10n functionality in a self-contained
    extension

❖   Ensuring that it worked with PHP 5
5. CODE SEGREGATION


❖   Proof-of-concept developed by only a few people

❖   Faster decisions, iteration, development

❖   Things slowed down after merging into the main tree

    ‣   but was necessary to spread the workload
WHAT WENT RONG
1. CHOICE OF UTF-16
Thought to be the best compromise
UTF-8


❖   Backward-compatible with ASCII

❖   Avoids complications of endianness

❖   Dominant UTF encoding for the Web

❖   Supported in a lot of libraries, APIs, etc
UTF-8, BUT…

❖   Variable-length encoding (1-4 bytes)

❖   Uses 3 bytes for BMP code points > U+07FF

❖   Not all byte sequences are valid

❖   ICU did not have many UTF-8 APIs (at the time)

    ‣   on-the-fly conversion is necessary
UTF-32



❖   Uses exactly 4 bytes for each code point

    ‣   directly indexable!
UTF-32, BUT...

❖   Uses exactly 4 bytes for each code point

    ‣   4x the size of UTF-8 for majority of languages

❖   Only affordable by people from rich oil countries

❖   Still needs conversion to UTF-16 when using ICU

❖   Endianness
UTF-16


❖   “65,536 code points should be enough for everyone…”

❖   2 bytes to represent all of BMP (U+0 to U+FFFF)

    ‣   directly indexable in that plane

❖   Internal encoding of ICU
UTF-16, BUT…

❖   Requires surrogate pairs for code points > U+FFFF

    ‣   still variable-length

❖   2x the size of UTF-8 for Latin, Greek, Cyrillic, Armenian,
    Hebrew, Arabic and other scripts

❖   Can’t be manipulated by normal C string handling

❖   Endianness
CHOICE OF UTF-16

❖   Thought that CJK languages would benefit from UTF-16

❖   Primary driver was the ICU APIs

❖   Problems: no direct indexing, many conversions

❖   Would probably choose UTF-8, if started over

    ‣   no need for decoding/encoding on the periphery

    ‣   can be used by C-based libraries
2. CRUCIAL CODE LAGGED
2. CRUCIAL CODE LAGGED


❖   PDO (and native DB extensions)

❖   filter

❖   proper substring search (collation-based)

❖   some ext/standard functionality
3. LACK OF MINDSHARE
3. LACK OF MINDSHARE

❖   Probably <10 people who understood the intricacies of
    the Unicode and ICU

❖   In the end, implementation deemed too technically
    difficult

❖   People were bored converting large chunks of already
    working code
4. DELAYED NEW FEATURES
5. MEA CULPA
END GAME
RE-ORG

❖   PHP 6 trunk was moved to a branch

❖   PHP 5.4 became the trunk

❖   Kick-started development of new features

❖   Some clean-ups and improvements from 6 back-ported
    to 5.4
PEOPLE MATTER


❖   The project ran out of steam

    ‣   PHP development culture means that people work on
        what they’re interested in

    ‣   Clearly, the Unicode/i18n implementation wasn’t
        interesting enough to be viable
INTERNALS

“Because it’s nearly impossible to
participate on Internals if your
poo-throwing arm isn’t strong.”
                       — @coates
PERSISTENCE

“Those with talent, competence, energy,
and good ideas over a period of time
tend to be the main drivers behind PHP
development.”
                                   — me
PERSISTENCE

“Those with talent, competence, energy,
and good ideas over a period of time
        and who outlast the rest
tend to be the main drivers behind PHP
development.”
                                  — me
PROGRESS



❖   No development on the Unicode branch

❖   No visible effort to develop alternatives
FUTURE?

❖   Lighter, gentler implementations?

    ‣   mbstring is clunky

    ‣   separate Unicode String class would also be clunky

❖   Open field for someone with a great idea, persistence,
    and people skills
STEPPING AWAY


❖   Invalidation of several man-years of hard work is
    discouraging

❖   Did not feel like pushing the project up the hill again

❖   Working on more fun stuff these days
LESSONS LEARNED

❖   Rewriting large existing code base is hard

❖   Making people do tedious stuff is hard

    ‣   make it interesting for them (game-like)

❖   Waiting for results of long iterations is hard

    ‣   short, results-oriented projects (if possible)

❖   Stay committed
FINITA LA COMEDIA




http://joind.in/3349 ❖ http://zazzle.com/andreiz

More Related Content

What's hot

Context free languages
Context free languagesContext free languages
Context free languages
Jahurul Islam
 
Chapter1 Formal Language and Automata Theory
Chapter1 Formal Language and Automata TheoryChapter1 Formal Language and Automata Theory
Chapter1 Formal Language and Automata Theory
Tsegazeab Asgedom
 
Finite Automata in compiler design
Finite Automata in compiler designFinite Automata in compiler design
Finite Automata in compiler design
Riazul Islam
 
Regular expression with DFA
Regular expression with DFARegular expression with DFA
Regular expression with DFA
Maulik Togadiya
 
Relational algebra-and-relational-calculus
Relational algebra-and-relational-calculusRelational algebra-and-relational-calculus
Relational algebra-and-relational-calculus
Salman Vadsarya
 
1.3.2 non deterministic finite automaton
1.3.2 non deterministic finite automaton1.3.2 non deterministic finite automaton
1.3.2 non deterministic finite automaton
Sampath Kumar S
 
Automata theory - CFG and normal forms
Automata theory - CFG and normal formsAutomata theory - CFG and normal forms
Automata theory - CFG and normal forms
Akila Krishnamoorthy
 
Context free grammar
Context free grammar Context free grammar
Context free grammar
Mohammad Ilyas Malik
 
Unix shell scripts
Unix shell scriptsUnix shell scripts
Unix shell scripts
Prakash Lambha
 
Regular Expressions grep and egrep
Regular Expressions grep and egrepRegular Expressions grep and egrep
Regular Expressions grep and egrepTri Truong
 
Лекция 6: Словари. Хеш-таблицы
Лекция 6: Словари. Хеш-таблицыЛекция 6: Словари. Хеш-таблицы
Лекция 6: Словари. Хеш-таблицыMikhail Kurnosov
 
Theory of Automata and formal languages unit 2
Theory of Automata and formal languages unit 2Theory of Automata and formal languages unit 2
Theory of Automata and formal languages unit 2
Abhimanyu Mishra
 
Introduction to Debuggers
Introduction to DebuggersIntroduction to Debuggers
Introduction to Debuggers
Saumil Shah
 
Code generator
Code generatorCode generator
Code generatorTech_MX
 
1.7. eqivalence of nfa and dfa
1.7. eqivalence of nfa and dfa1.7. eqivalence of nfa and dfa
1.7. eqivalence of nfa and dfa
Sampath Kumar S
 
Introduction to Assembly Language Programming
Introduction to Assembly Language ProgrammingIntroduction to Assembly Language Programming
Introduction to Assembly Language Programming
Rahul P
 
Lecture 16 17 code-generation
Lecture 16 17 code-generationLecture 16 17 code-generation
Lecture 16 17 code-generation
Iffat Anjum
 
Theory of computation Lec7 pda
Theory of computation Lec7 pdaTheory of computation Lec7 pda
Theory of computation Lec7 pda
Arab Open University and Cairo University
 
Lexical analysis - Compiler Design
Lexical analysis - Compiler DesignLexical analysis - Compiler Design
Lexical analysis - Compiler Design
Muhammed Afsal Villan
 

What's hot (20)

Context free languages
Context free languagesContext free languages
Context free languages
 
Chapter1 Formal Language and Automata Theory
Chapter1 Formal Language and Automata TheoryChapter1 Formal Language and Automata Theory
Chapter1 Formal Language and Automata Theory
 
Finite Automata in compiler design
Finite Automata in compiler designFinite Automata in compiler design
Finite Automata in compiler design
 
Regular expression with DFA
Regular expression with DFARegular expression with DFA
Regular expression with DFA
 
Relational algebra-and-relational-calculus
Relational algebra-and-relational-calculusRelational algebra-and-relational-calculus
Relational algebra-and-relational-calculus
 
1.3.2 non deterministic finite automaton
1.3.2 non deterministic finite automaton1.3.2 non deterministic finite automaton
1.3.2 non deterministic finite automaton
 
Automata theory - CFG and normal forms
Automata theory - CFG and normal formsAutomata theory - CFG and normal forms
Automata theory - CFG and normal forms
 
Context free grammar
Context free grammar Context free grammar
Context free grammar
 
Dfa basics
Dfa basicsDfa basics
Dfa basics
 
Unix shell scripts
Unix shell scriptsUnix shell scripts
Unix shell scripts
 
Regular Expressions grep and egrep
Regular Expressions grep and egrepRegular Expressions grep and egrep
Regular Expressions grep and egrep
 
Лекция 6: Словари. Хеш-таблицы
Лекция 6: Словари. Хеш-таблицыЛекция 6: Словари. Хеш-таблицы
Лекция 6: Словари. Хеш-таблицы
 
Theory of Automata and formal languages unit 2
Theory of Automata and formal languages unit 2Theory of Automata and formal languages unit 2
Theory of Automata and formal languages unit 2
 
Introduction to Debuggers
Introduction to DebuggersIntroduction to Debuggers
Introduction to Debuggers
 
Code generator
Code generatorCode generator
Code generator
 
1.7. eqivalence of nfa and dfa
1.7. eqivalence of nfa and dfa1.7. eqivalence of nfa and dfa
1.7. eqivalence of nfa and dfa
 
Introduction to Assembly Language Programming
Introduction to Assembly Language ProgrammingIntroduction to Assembly Language Programming
Introduction to Assembly Language Programming
 
Lecture 16 17 code-generation
Lecture 16 17 code-generationLecture 16 17 code-generation
Lecture 16 17 code-generation
 
Theory of computation Lec7 pda
Theory of computation Lec7 pdaTheory of computation Lec7 pda
Theory of computation Lec7 pda
 
Lexical analysis - Compiler Design
Lexical analysis - Compiler DesignLexical analysis - Compiler Design
Lexical analysis - Compiler Design
 

Viewers also liked

PHP 7 – What changed internally?
PHP 7 – What changed internally?PHP 7 – What changed internally?
PHP 7 – What changed internally?
Nikita Popov
 
W3 conf hill-html5-security-realities
W3 conf hill-html5-security-realitiesW3 conf hill-html5-security-realities
W3 conf hill-html5-security-realities
Brad Hill
 
PHP 5.3 And PHP 6 A Look Ahead
PHP 5.3 And PHP 6 A Look AheadPHP 5.3 And PHP 6 A Look Ahead
PHP 5.3 And PHP 6 A Look Aheadthinkphp
 
Introduction to php 6
Introduction to php   6Introduction to php   6
Introduction to php 6
pctechnology
 
All The Little Pieces
All The Little PiecesAll The Little Pieces
All The Little Pieces
Andrei Zmievski
 
Diapositiva reclutamiento y seleccion leymar jimenez
Diapositiva reclutamiento y seleccion leymar jimenezDiapositiva reclutamiento y seleccion leymar jimenez
Diapositiva reclutamiento y seleccion leymar jimenez
Leymar jimenez
 
Consórcio realiza consta da relação das administradoras de consórcios autoriz...
Consórcio realiza consta da relação das administradoras de consórcios autoriz...Consórcio realiza consta da relação das administradoras de consórcios autoriz...
Consórcio realiza consta da relação das administradoras de consórcios autoriz...
Jessica R.
 
El señor de los milagros
El señor de los milagrosEl señor de los milagros
El señor de los milagros
Jose Carlos Donayre
 
Prophet's Sunnah, tagalog
Prophet's Sunnah, tagalogProphet's Sunnah, tagalog
Prophet's Sunnah, tagalog
Arab Muslim
 
Perfil sociodemográfico de los internautas 2013 - ONTSI
Perfil sociodemográfico de los internautas 2013 - ONTSIPerfil sociodemográfico de los internautas 2013 - ONTSI
Perfil sociodemográfico de los internautas 2013 - ONTSI
Aritz Pérez
 
Einführung ins eCampaigning
Einführung ins eCampaigning Einführung ins eCampaigning
Einführung ins eCampaigning
more onion
 
HTML5 on Mobile
HTML5 on MobileHTML5 on Mobile
HTML5 on Mobile
Adam Lu
 
Gabriel
GabrielGabriel
Vloge (funkcije) umetnosti v družbi (2) 2.del
Vloge (funkcije) umetnosti v družbi (2) 2.delVloge (funkcije) umetnosti v družbi (2) 2.del
Vloge (funkcije) umetnosti v družbi (2) 2.delsKastelic
 
First Beat Media - Rad od kuće #tnt3
First Beat Media - Rad od kuće #tnt3First Beat Media - Rad od kuće #tnt3
First Beat Media - Rad od kuće #tnt3SICEF
 

Viewers also liked (20)

Php Unicode I18n
Php Unicode I18nPhp Unicode I18n
Php Unicode I18n
 
PHP 7 – What changed internally?
PHP 7 – What changed internally?PHP 7 – What changed internally?
PHP 7 – What changed internally?
 
W3 conf hill-html5-security-realities
W3 conf hill-html5-security-realitiesW3 conf hill-html5-security-realities
W3 conf hill-html5-security-realities
 
Php Presentation
Php PresentationPhp Presentation
Php Presentation
 
HTML 5 Accessibility
HTML 5 AccessibilityHTML 5 Accessibility
HTML 5 Accessibility
 
PHP 5.3 And PHP 6 A Look Ahead
PHP 5.3 And PHP 6 A Look AheadPHP 5.3 And PHP 6 A Look Ahead
PHP 5.3 And PHP 6 A Look Ahead
 
Introduction to php 6
Introduction to php   6Introduction to php   6
Introduction to php 6
 
All The Little Pieces
All The Little PiecesAll The Little Pieces
All The Little Pieces
 
Diapositiva reclutamiento y seleccion leymar jimenez
Diapositiva reclutamiento y seleccion leymar jimenezDiapositiva reclutamiento y seleccion leymar jimenez
Diapositiva reclutamiento y seleccion leymar jimenez
 
Consórcio realiza consta da relação das administradoras de consórcios autoriz...
Consórcio realiza consta da relação das administradoras de consórcios autoriz...Consórcio realiza consta da relação das administradoras de consórcios autoriz...
Consórcio realiza consta da relação das administradoras de consórcios autoriz...
 
El señor de los milagros
El señor de los milagrosEl señor de los milagros
El señor de los milagros
 
Prophet's Sunnah, tagalog
Prophet's Sunnah, tagalogProphet's Sunnah, tagalog
Prophet's Sunnah, tagalog
 
Roma
RomaRoma
Roma
 
Perfil sociodemográfico de los internautas 2013 - ONTSI
Perfil sociodemográfico de los internautas 2013 - ONTSIPerfil sociodemográfico de los internautas 2013 - ONTSI
Perfil sociodemográfico de los internautas 2013 - ONTSI
 
Einführung ins eCampaigning
Einführung ins eCampaigning Einführung ins eCampaigning
Einführung ins eCampaigning
 
HTML5 on Mobile
HTML5 on MobileHTML5 on Mobile
HTML5 on Mobile
 
Gabriel
GabrielGabriel
Gabriel
 
Air mobility command amc travel contacts
Air mobility command   amc travel contactsAir mobility command   amc travel contacts
Air mobility command amc travel contacts
 
Vloge (funkcije) umetnosti v družbi (2) 2.del
Vloge (funkcije) umetnosti v družbi (2) 2.delVloge (funkcije) umetnosti v družbi (2) 2.del
Vloge (funkcije) umetnosti v družbi (2) 2.del
 
First Beat Media - Rad od kuće #tnt3
First Beat Media - Rad od kuće #tnt3First Beat Media - Rad od kuće #tnt3
First Beat Media - Rad od kuće #tnt3
 

Similar to The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6

Unicode & PHP6
Unicode & PHP6Unicode & PHP6
Unicode & PHP6
Karsten Dambekalns
 
Using unicode with php
Using unicode with phpUsing unicode with php
Using unicode with php
Elizabeth Smith
 
How To Build And Launch A Successful Globalized App From Day One Or All The ...
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...
agileware
 
Uncdtalk
UncdtalkUncdtalk
Perl5 VS JSON
Perl5 VS JSONPerl5 VS JSON
Perl5 VS JSON
karupanerura
 
Internationlization
InternationlizationInternationlization
InternationlizationTuan Ngo
 
Unicode 101
Unicode 101Unicode 101
Unicode 101
davidfstr
 
Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - IT
guest6ddfb98
 
PHP for Grown-ups
PHP for Grown-upsPHP for Grown-ups
PHP for Grown-ups
Manuel Lemos
 
Exploring Challenges in Mining Historical Text
Exploring Challenges in Mining Historical Text Exploring Challenges in Mining Historical Text
Exploring Challenges in Mining Historical Text Beatrice Alex
 
Abap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesAbap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesMilind Patil
 
COMPUTER LANGUAGES AND THERE DIFFERENCE
COMPUTER LANGUAGES AND THERE DIFFERENCE COMPUTER LANGUAGES AND THERE DIFFERENCE
COMPUTER LANGUAGES AND THERE DIFFERENCE
Pavan Kalyan
 
2016 bioinformatics i_python_part_1_wim_vancriekinge
2016 bioinformatics i_python_part_1_wim_vancriekinge2016 bioinformatics i_python_part_1_wim_vancriekinge
2016 bioinformatics i_python_part_1_wim_vancriekinge
Prof. Wim Van Criekinge
 
Unicode for Small Children (and Children at Heart)
Unicode for Small Children (and Children at Heart)Unicode for Small Children (and Children at Heart)
Unicode for Small Children (and Children at Heart)
Feihong Hsu
 
Lect 1. introduction to programming languages
Lect 1. introduction to programming languagesLect 1. introduction to programming languages
Lect 1. introduction to programming languagesVarun Garg
 
Desert Code Camp 2014: C#, the best programming language
Desert Code Camp 2014: C#, the best programming languageDesert Code Camp 2014: C#, the best programming language
Desert Code Camp 2014: C#, the best programming language
James Montemagno
 

Similar to The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6 (20)

Unicode & PHP6
Unicode & PHP6Unicode & PHP6
Unicode & PHP6
 
Using unicode with php
Using unicode with phpUsing unicode with php
Using unicode with php
 
Using unicode with php
Using unicode with phpUsing unicode with php
Using unicode with php
 
How To Build And Launch A Successful Globalized App From Day One Or All The ...
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...
 
Uncdtalk
UncdtalkUncdtalk
Uncdtalk
 
Perl5 VS JSON
Perl5 VS JSONPerl5 VS JSON
Perl5 VS JSON
 
Internationlization
InternationlizationInternationlization
Internationlization
 
Unicode
UnicodeUnicode
Unicode
 
Unicode 101
Unicode 101Unicode 101
Unicode 101
 
Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - IT
 
PHP for Grown-ups
PHP for Grown-upsPHP for Grown-ups
PHP for Grown-ups
 
Exploring Challenges in Mining Historical Text
Exploring Challenges in Mining Historical Text Exploring Challenges in Mining Historical Text
Exploring Challenges in Mining Historical Text
 
Abap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesAbap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfiles
 
COMPUTER LANGUAGES AND THERE DIFFERENCE
COMPUTER LANGUAGES AND THERE DIFFERENCE COMPUTER LANGUAGES AND THERE DIFFERENCE
COMPUTER LANGUAGES AND THERE DIFFERENCE
 
2016 bioinformatics i_python_part_1_wim_vancriekinge
2016 bioinformatics i_python_part_1_wim_vancriekinge2016 bioinformatics i_python_part_1_wim_vancriekinge
2016 bioinformatics i_python_part_1_wim_vancriekinge
 
Unicode for Small Children (and Children at Heart)
Unicode for Small Children (and Children at Heart)Unicode for Small Children (and Children at Heart)
Unicode for Small Children (and Children at Heart)
 
P1 2017 python
P1 2017 pythonP1 2017 python
P1 2017 python
 
Lect 1. introduction to programming languages
Lect 1. introduction to programming languagesLect 1. introduction to programming languages
Lect 1. introduction to programming languages
 
Desert Code Camp 2014: C#, the best programming language
Desert Code Camp 2014: C#, the best programming languageDesert Code Camp 2014: C#, the best programming language
Desert Code Camp 2014: C#, the best programming language
 
Notes on a Standard: Unicode
Notes on a Standard: UnicodeNotes on a Standard: Unicode
Notes on a Standard: Unicode
 

Recently uploaded

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
Globus
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 

Recently uploaded (20)

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 

The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6

  • 1. THE GOOD, THE BAD, AND THE UGLY What Happened to Unicode and PHP 6 Andrei Zmievski ❖ PHP Community Conference
  • 2. ABOUT 1 YEAR AGO… “Hello PHP 5.4, open for all new stuff.” — Jani
  • 4. 5 YEARS EARLIER… PHP 5.0.0 released in July 2004
  • 5. 5 YEARS EARLIER… Firefox 1.0 released in November 2004
  • 6. 5 YEARS EARLIER… Chrome not even a twinkle in Google’s eye
  • 7. 5 YEARS EARLIER… Unicode version 4.0.1
  • 8. WHAT IS UNICODE? and why do I need it?
  • 9. Unicode …is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems.
  • 10. Unicode provides a unique number for every character: no matter what the platform, no matter what the program, no matter what the language.
  • 11. UNICODE STANDARD ❖ Developed by the Unicode Consortium ❖ Covers all major living scripts ❖ Version 6.0 has 109,000+ characters ❖ Capacity for 1 million+ characters ❖ Widely supported by standards & industry
  • 12. FEATURES ❖ Rich property set for every character ❖ Standard, unified encodings: UTF-8/16/32 ❖ Extensive rules and documents for implementation ❖ Everything works, as long as everyone follows the rules
  • 13. UNICODE != I18N ❖ Unicode simplifies development ❖ Unicode does not fix all internationalization problems
  • 14. TIME FORMATS ❖ USA: 4:00  P.M. ❖ France: 16.00 ❖ Japan: 1600   ❖ Don’t forget to identify the time zone
  • 15. CURRENCY ❖ Symbol placement US  $12.34 ❖ Symbol length (1-15) 12.345,67  € ❖ Number width 12$34€ ❖ Number precision: ¥123 ‣ Spain, Japan –0 ‣ Mexico, Brazil – 2 ‣ Egypt, Iraq –3
  • 16. SORTING ❖ Swedish: z<ö ❖ German: ö<z ❖ Dictionary: öf < of ❖ Phonebook: of < öf ❖ Upper-first: A<a ❖ Lower-First: a<A ❖ Contractions: H < Z, but CH > CZ ❖ Expansions: OE < Œ < OF
  • 17. CLDR ❖ Hosted by Unicode Consortium ❖ Latest release: December 2010 (CLDR 1.9) ❖ 516 locales, with 187 languages and 166 territories
  • 18. WHY WEB NEEDS UNICODE
  • 20. MOJIBAKE noun: phenomenon of incorrect, unreadable characters shown when computer software fails to render a text correctly according to its associated character encoding.
  • 24. I | UNICODE, YOU | UNICODE
  • 25. Helgi
  • 26. Helgi
  • 28. ISLTHORP or Mr. SECURITY OVERRIDE
  • 29.
  • 30. Joel
  • 31. Joël
  • 32. Joël
  • 36. WHY PHP NEEDS UNICODE
  • 37. PHP ❖ Essential Web platform ❖ Since Web needs Unicode… ❖ …so does PHP ❖ Do not want to be the weakest link
  • 39. THE PROJECT ❖ Launched in February 2005 by me at Yahoo ❖ Small group from Yahoo, Zend, and PHP development community ❖ Design before code
  • 40. UNICODE SUPPORT ❖ Everywhere: ‣ in the engine ‣ in the extensions ‣ in the API
  • 41. UNICODE SUPPORT ❖ Native and complete ‣ no hacks ‣ no mishmash of external libraries ‣ no missing locales ‣ no language bias
  • 42. ICU LIBRARY International Components for Unicode ✓ Unicode Character Properties ✓ Formatting: Date/Time/ ✓ Unicode String Class & text Numbers/Currency processing ✓ Cultural Calendars & Time ✓ Text transformations Zones (normalization, upper/lowercase, ✓ (230+) Locale handling etc) ✓ Resource Bundles ✓ Text Boundary Analysis ✓ Transliterations (50+ script (Character/Word/Sentence Break pairs) Iterators) ✓ Complex Text Layout for Arabic, ✓ Encoding Conversions for 500+ Hebrew, Indic & Thai legacy encodings ✓ International Domain Names ✓ Language-sensitive collation and Web addresses (sorting) and searching ✓ Java model for locale- ✓ Unicode regular expressions hierarchical resource bundles. ✓ Thread-safe Multiple locales can be used at a time
  • 43. THE PROJECT ❖ Development was in a separate repository ❖ Merged into PHP tree once the basics were working ❖ Initially slated for 5.x ❖ Extensive changes necessitated a major version bump
  • 44. PHP 6 = PHP 5 + Unicode
  • 45. PHP 5 = PHP 6 - Unicode
  • 46. Unicode = PHP 6 - PHP 5
  • 47. PHP 6
  • 48. PHP 6 6
  • 49. STRING TYPES ❖ Unicode ‣ text ‣ default for literals, etc ❖ Binary ‣ bytes ‣ everything ∉ Unicode type
  • 50. Conversions Dataflow streams s c ng ifi di ec PHP co -sp en am Unicode re st strings runtime encoding request response HTTP input binary HTTP output encoding strings encoding ng fil co i es d od en ys in nc te g te m rip sc scripts filesystem
  • 51. STRINGS ❖ String literals are Unicode ❖ String offsets work on code points $str  =  " "; //  2  code  points echo  $str[1]; //  result  is     $str[0]  =  ' '; //  full  string  is  now  
  • 52. IDENTIFIERS ❖ Unicode identifiers are allowed class    {        function  ᓱᓴᓐ  ᐊᒡᓗᒃᑲᖅ() {  ...  }      function  !வா$  கேனச) ()    {  ...  }      function  འ"ག་%ལ།()        {  ...  } }   $  =  array(); $ ['‫  =  ]'ַרעְיולּוחַ  ׁשָנָה‬new   ;
  • 53. FUNCTIONS ❖ Functions understand Unicode text and apply appropriate rules ❖ i.e. case manipulation $str  =  strtoupper("fußball"); //  result  is  FUSSBALL $str  =  strtolower("ΣΕΛΛΑΣ"); //  result  is  σελλάς  
  • 54. TRANSLITERATION $names  =  "      김,  국삼      김,  명희       ,         ,        Горбачев,  Михаил      Козырев,  Андрей      Καφετζόπουλος,  Θεόφιλος      Θεοδωράτου,  Ελένη   ";   $r  =  strtotitle(str_transliterate($names,  "Any",  "Latin")); Gim,  Gugsam Gim,  Myeonghyi Takeda,  Masayuki Oohara,  Manabu Gorbačev,  Mihail Kozyrev,  Andrej Kaphetzópoulos,  Theóphilos Theodōrátou,  Elénē
  • 56. FEATURES ❖ Locales ❖ Collation ❖ Number and Currency Formatters ❖ Date and Time Formatters ❖ Time Zones ❖ Calendars ❖ Message Formatter ❖ Choice Formatter ❖ Resource Handler ❖ Normalization
  • 57. COLLATION sorting $strings  =  array(                  "cote",  "côte",  "Côte",  "coté",                  "Coté",  "côté",  "Côté",  "coter");   $coll  =  new  Collator("fr_FR");   $coll-­‐>sort($strings); result cote côte Côte coté Coté côté Côté coter
  • 58. NUMBER FORMATTING 123456.789  in  en_US ❖ NumberFormatter::DECIMAL 123456.789 ❖ NumberFormatter::CURRENCY $123,456.79 ❖ NumberFormatter::ORDINAL 123,457th ❖ NumberFormatter::SPELLOUT one hundred and twenty-three thousand, four hundred and fifty-six point seven eight nine
  • 59. MESSAGE FORMATTING with modifiers $pattern  =  “On  {0,date,full}  you  received                        {1,number,#,##0.00}  emails.”; $args  =  array(time(),  1184);   $fmt  =  new  MessageFormatter(‘en_US’,  $pattern); echo  $fmt-­‐>format($args); result On  Tuesday,  November  22,  2007  you  received   1,184.00  emails.
  • 62. 1. RAISED AWARENESS ❖ Spoke at multiple conferences about the project ‣ including Unicode Conference ❖ Shoved Unicode down people’s throats at every opportunity
  • 63. 2. CHOSE THE RIGHT TECH ❖ ICU library had everything we needed ❖ Low- and high-level functionality ❖ Good support from its developers
  • 64. 3. UNIT TESTS ❖ Every function handling strings had to be ported ❖ Unit tests showed us where things broke ❖ Also easy to track progress
  • 65. 4. PECL/INTL EXTENSION ❖ A lot of i18n/l10n functionality in a self-contained extension ❖ Ensuring that it worked with PHP 5
  • 66. 5. CODE SEGREGATION ❖ Proof-of-concept developed by only a few people ❖ Faster decisions, iteration, development ❖ Things slowed down after merging into the main tree ‣ but was necessary to spread the workload
  • 68. 1. CHOICE OF UTF-16 Thought to be the best compromise
  • 69. UTF-8 ❖ Backward-compatible with ASCII ❖ Avoids complications of endianness ❖ Dominant UTF encoding for the Web ❖ Supported in a lot of libraries, APIs, etc
  • 70. UTF-8, BUT… ❖ Variable-length encoding (1-4 bytes) ❖ Uses 3 bytes for BMP code points > U+07FF ❖ Not all byte sequences are valid ❖ ICU did not have many UTF-8 APIs (at the time) ‣ on-the-fly conversion is necessary
  • 71. UTF-32 ❖ Uses exactly 4 bytes for each code point ‣ directly indexable!
  • 72. UTF-32, BUT... ❖ Uses exactly 4 bytes for each code point ‣ 4x the size of UTF-8 for majority of languages ❖ Only affordable by people from rich oil countries ❖ Still needs conversion to UTF-16 when using ICU ❖ Endianness
  • 73. UTF-16 ❖ “65,536 code points should be enough for everyone…” ❖ 2 bytes to represent all of BMP (U+0 to U+FFFF) ‣ directly indexable in that plane ❖ Internal encoding of ICU
  • 74. UTF-16, BUT… ❖ Requires surrogate pairs for code points > U+FFFF ‣ still variable-length ❖ 2x the size of UTF-8 for Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic and other scripts ❖ Can’t be manipulated by normal C string handling ❖ Endianness
  • 75. CHOICE OF UTF-16 ❖ Thought that CJK languages would benefit from UTF-16 ❖ Primary driver was the ICU APIs ❖ Problems: no direct indexing, many conversions ❖ Would probably choose UTF-8, if started over ‣ no need for decoding/encoding on the periphery ‣ can be used by C-based libraries
  • 76. 2. CRUCIAL CODE LAGGED
  • 77. 2. CRUCIAL CODE LAGGED ❖ PDO (and native DB extensions) ❖ filter ❖ proper substring search (collation-based) ❖ some ext/standard functionality
  • 78. 3. LACK OF MINDSHARE
  • 79. 3. LACK OF MINDSHARE ❖ Probably <10 people who understood the intricacies of the Unicode and ICU ❖ In the end, implementation deemed too technically difficult ❖ People were bored converting large chunks of already working code
  • 80. 4. DELAYED NEW FEATURES
  • 83. RE-ORG ❖ PHP 6 trunk was moved to a branch ❖ PHP 5.4 became the trunk ❖ Kick-started development of new features ❖ Some clean-ups and improvements from 6 back-ported to 5.4
  • 84. PEOPLE MATTER ❖ The project ran out of steam ‣ PHP development culture means that people work on what they’re interested in ‣ Clearly, the Unicode/i18n implementation wasn’t interesting enough to be viable
  • 85. INTERNALS “Because it’s nearly impossible to participate on Internals if your poo-throwing arm isn’t strong.” — @coates
  • 86. PERSISTENCE “Those with talent, competence, energy, and good ideas over a period of time tend to be the main drivers behind PHP development.” — me
  • 87. PERSISTENCE “Those with talent, competence, energy, and good ideas over a period of time and who outlast the rest tend to be the main drivers behind PHP development.” — me
  • 88. PROGRESS ❖ No development on the Unicode branch ❖ No visible effort to develop alternatives
  • 89. FUTURE? ❖ Lighter, gentler implementations? ‣ mbstring is clunky ‣ separate Unicode String class would also be clunky ❖ Open field for someone with a great idea, persistence, and people skills
  • 90. STEPPING AWAY ❖ Invalidation of several man-years of hard work is discouraging ❖ Did not feel like pushing the project up the hill again ❖ Working on more fun stuff these days
  • 91. LESSONS LEARNED ❖ Rewriting large existing code base is hard ❖ Making people do tedious stuff is hard ‣ make it interesting for them (game-like) ❖ Waiting for results of long iterations is hard ‣ short, results-oriented projects (if possible) ❖ Stay committed
  • 92. FINITA LA COMEDIA http://joind.in/3349 ❖ http://zazzle.com/andreiz