Alphabet Detection and Frequency Analysis of Unicode Ranges with PHP

Repository

many-languages.jpg

Related Repositories

Have you ever needed to create a random string with Unicode characters encoded in blocks that you'd want to pick at will? I did a few months ago but couldn't find any library to easily achieve my goal.

So I decided to write Unicode Ranges which is a PHP library that provides you with Unicode ranges -- blocks, if you like -- in a friendly, object-oriented way.

By the way, if you are not very familiar with Unicode click here for a quick introduction to the ranges: Basic Latin, Cyrillic, Hangul Hamo, and many, many others.

Here is an example that creates a random char encoded in any of these three Unicode ranges: BasicLatin, Tibetan and Cherokee.

use UnicodeRanges\Randomizer;
use UnicodeRanges\Range\BasicLatin;
use UnicodeRanges\Range\Tibetan;
use UnicodeRanges\Range\Cherokee;

$char = Randomizer::char([
    new BasicLatin,
    new Tibetan,
    new Cherokee,
]);

echo $char . PHP_EOL;

Output:


And this is how to create a random string with Arabic, HangulJamo and Phoenician characters:

use UnicodeRanges\Randomizer;
use UnicodeRanges\Range\Arabic;
use UnicodeRanges\Range\HangulJamo;
use UnicodeRanges\Range\Phoenician;

$letters = Randomizer::letters([
    new Arabic,
    new HangulJamo,
    new Phoenician,
], 20);

echo $letters . PHP_EOL;

Output:

ڽە

Very useful if you want to create random UTF-8 tokens for example.

I hope these examples will give you the context to follow my explanation -- for further information please read the Documentation.

New Features

Let's now cut to the chase.

Yesterday I created the following Unicode Ranges feature for Babylon to be able to compute the ranges' frequencies -- or put another way, the number of times that a particular unicode range appears in a text.

babel.jpg

The ultimate goal is for the language detector to understand alphabets.

This is how the feature is implemented:

On the one hand, PowerRanges provides with an array containing all 255 Unicode ranges.

Of course, I didn't manually instantiate the 255 classes, which would have been just tedious! Note that the PowerRanges array is dynamically built by reading the files stored in the unicode-ranges/src/Range/ folder.

This is possible with PHP's ReflectionClass.

<?php
namespace UnicodeRanges;
class PowerRanges
{
    const RANGES_FOLDER = __DIR__ . '/Range';
    protected $ranges = [];
    public function __construct()
    {
        $files = array_diff(scandir(self::RANGES_FOLDER), ['.', '..']);
        foreach ($files as $file) {
            $filename = pathinfo($file, PATHINFO_FILENAME);
            $classname = "\\UnicodeRanges\\Range\\$filename" ;
            $rangeClass = new \ReflectionClass($classname);
            $rangeObj = $rangeClass->newInstanceArgs();
            $this->ranges[] = $rangeObj;
        }
    }
    public function ranges()
    {
        return $this->ranges;
    }
}

On the other hand, Converter::unicode2range($char) converts any multibyte char into its object-oriented Unicode range counterpart.

Example:

use UnicodeRanges\Converter;

$char = 'a';
$range = Converter::unicode2range($char);

echo "Total: {$range->count()}".PHP_EOL;
echo "Name: {$range->name()}".PHP_EOL;
echo "Range: {$range->range()[0]}-{$range->range()[1]}".PHP_EOL;
echo 'Characters: ' . PHP_EOL;
print_r($range->chars());

Output:

Total: 96
Name: Basic Latin
Range: 0020-007F
Characters:
Array
(
    [0] =>  
    [1] => !
    [2] => "
    [3] => #
    [4] => $
    [5] => %
    [6] => &
    [7] => '
    ...

stats.jpg

This is how Babylon can now analyze the frequency of the Unicode ranges:

/**
 * @test
 */
public function freq()
{
    $text = '律絕諸篇俱宇宙古今مليارات في мале,тъйжалнопе hola que tal como 토마토쥬스 estas tu hoy この平安朝の';
    $expected = [
        'Basic Latin' => 25,
        'Cyrillic' => 14,
        'CJK Unified Ideographs' => 12,
        'Arabic' => 9,
        'Hangul Syllables' => 5,
        'Hiragana' => 3,
    ];

    $this->assertEquals($expected, (new UnicodeRangeStats($text))->freq());
}

As you can see, a UnicodeRangeStats class is instantiated, which is the one running Converter::unicode2range($char); as it is shown below.

<?php

namespace Babylon;

use Babylon;
use UnicodeRanges\Converter;

/**
 * Unicode range stats.
 *
 * @author Jordi Bassagañas <info@programarivm.com>
 * @link https://programarivm.com
 * @license MIT
 */
class UnicodeRangeStats
{
    const N_FREQ_UNICODE_RANGES = 10;

    /**
     * Text to be analyzed.
     *
     * @var string
     */
    protected $text;

    /**
     * Unicode ranges frequency -- number of times that the unicode ranges appear in the text.
     *
     * Example:
     *
     *      Array
     *      (
     *         [Basic Latin] => 25
     *         [Cyrillic] => 14
     *         [CJK Unified Ideographs] => 12
     *         [Arabic] => 9
     *         [Hangul Syllables] => 5
     *         [Hiragana] => 3
     *          ...
     *      )
     *
     * @var array
     */
    protected $freq;

    /**
     * Constructor.
     *
     * @param string $text
     */
    public function __construct(string $text)
    {
        $this->text = $text;
    }

    /**
     * The most frequent unicode ranges in the text.
     *
     * @return array
     * @throws \InvalidArgumentException
     */
    public function freq(): array
    {
        $chars = $this->mbStrSplit($this->text);
        foreach ($chars as $char) {
            $unicodeRange = Converter::unicode2range($char);
            empty($this->freq[$unicodeRange->name()])
                ? $this->freq[$unicodeRange->name()] = 1
                : $this->freq[$unicodeRange->name()] += 1;
        }
        arsort($this->freq);

        return array_slice($this->freq, 0, self::N_FREQ_UNICODE_RANGES);
    }

    /**
     * The most frequent unicode range in the text.
     *
     * @return \UnicodeRanges\AbstractRange
     * @throws \InvalidArgumentException
     */
    public function mostFreq(): string
    {
        return key(array_slice($this->freq(), 0, 1));
    }

    /**
     * Converts a multibyte string into an array of chars.
     *
     * @return array
     */
    private function mbStrSplit(string $text): array
    {
        $text = preg_replace('!\s+!', ' ', $text);
        $text = str_replace (' ', '', $text);

        return preg_split('/(?<!^)(?!$)/u', $text);
    }
}

That's all for now!

Today I showed you a few applications of the Unicode Ranges library:

  • Random phrases (tokens) with UTF chars
  • Alphabet detection
  • Frequency analysis of Unicode ranges

Could you think of any more to add to this list?

Any ideas are welcome! Thank you for reading today's post and sharing your views with the community.

GitHub Account

https://github.com/programarivm

H2
H3
H4
3 columns
2 columns
1 column
Join the conversation now
Logo
Center