Description
Sample data are various documents with localized names and localized content.
Supported Types
txt, gif, html, jpg, odt, pdf, png, ps, tar.gz
Audio and Video files (avi, mp3) (only once because of size)
Supported locales
bg_BG, cz_CS, da_DK, de_DE, el_GR, es_ES, et_EE, fi_FI, fr_FR, hr_HR, hu_HU, it_IT, kk_KZ, lt_LT, lv_LV, nb_NO, nl_NL, pl_PL, pt_BR, ro_RO, ru_RU, sk_SK, sl_SL, sv_SE, tr_TR, uk_UA
Structure of documents
- text documents have following structure:
1) locale
e.g.: cs_CZ.UTF-8
2) typic or general set of national characters
e.g.: áÁ â êÊ ŷŶ ûÛ îÎ íÍ ôÔ óÓ ŝŜ ĝĜ ĥĤ ĵĴ ŵŴ ĉĈ ąĄ źŹ éÉ ŕŔ ţŢ ýÝ íÍ óÓ ģĢ ĺĹ ćĆ ńŃ æ žŽ ěĚ řŘ ťŤ úÚ ůŮ šŠ ťŤ ďĎ čČ ňŇ chCh
3) localized text taken mostly from wikipedia
e.g.: Česko
Z Wikipedie, otevřené encyklopedie
Skočit na: Navigace, Hledání
Česká republika
vlajka
znak
Hymna: Kde domov můj
Motto: Pravda vítězí
etc...
- html, odt, pdf and ps files:
Created (exported) from the original txt file.
- tar.gz:
Created from all files.
- sort test files:
There is a special text document which is dedicated for testing of sorting. It contains one character per line and national characters are included too. It is randomly mixed ready to be sorted according to national rules.
Download links
For each language there is a small package of test data prepared. You can download those which you need here (utf-8 encoding):
- bg_BG
- cs_CZ
- da_DK
- de_DE
- el_GR
- es_ES
- et_EE
- fi_FI
- fr_FR
- hr_HR
- hu_HU
- it_IT
- kk_KZ
- lt_LT
- lv_LV
- nb_NO
- nl_NL
- pl_PL
- pt_BR
- ro_RO
- ru_RU
- sk_SK
- sl_SL
- sv_SE
- tr_TR
- uk_UA
There are also some special test files suitable for all languages: