On the difficulty of reading numbers in different languages

On the difficulty of reading numbers in different languages

This blog post illustrates how difficult it is for a simple seq2seq model to learn how to translate numbers from different languages (e.g. French, English, Chinese, Malay) to their digits (base 10) representation. It is based on the very good deep learning tutorials by Olivier Grisel and Charles Ollion. Note that this is a very simple seq2seq model, cf. fairseq or sockeye for more sophisticated ones.

The experiment: We illustrate the convergence of the model to perfect prediction on the test set as a function of the training set size. Faster increasing accuracy indicates easier learning task, i.e. the model requires less training examples. Moreover, the training set consists in randomly chosen numbers between 1 and 999,999. Furthermore, the model is fed with the language representation in input and has to output its digit (base 10) representation.

TL;DR
  • Chinese is the easiest to learn, then French (despite its seemingly many particular cases such as ‘vingt’ vs. ‘vingts’ and ‘cent’ vs. ‘cents’, closely followed by Malay. English is not that easy (maybe because of the ‘-‘s that have to be forgotten).
  • By looking at French examples, we might think that the model acquire some basic reasoning on arithmetic. Consider:
    • “quatre vingts”, literally meaning “four twenty”, stands for “80” (and not 420), i.e. it can be interpreted as the multiplication of four by twenty; Or even more complicated:
    • “quatre vingt onze mille”, literally meaning “four twenty eleven thousand”, which stands for “91000” (and not 420111000), i.e. it has to be interpreted as (4 * 20 + 11) * 1000.

To check whether the model is able to acquire some basic arithmetic skills, we have added the task of translating from hexadecimal to base 10 digits. Considering its poor results, it is unlikely that the model learns any arithmetic at all for performing its translation task. However, this task is more difficult (implicit base 16, and exponentiation based on the digit position). More on that in later posts…

Read Gautier’s Full Paper:

On the difficulty of reading numbers in different languages

Gautier Marti
Credit & Equity Strat with practical ML/Quant experience and high sharpe credit strategies.

École Polytechnique : Doctor of Philosophy (Ph.D.), Machine Learning – Quantitative FinanceDoctor of Philosophy (Ph.D.), Machine Learning – Quantitative Finance2014 – 2017

  • Copula theory, optimal transport, information geometry for processing and clustering financial time series with applications to the credit default swap market.
  • Jury: Damiano Brigo, Fabrizio Lillo, Rama Cont,
    Michalis Vazirgiannis, Fabio Caccioli, Julie Josse.
  • Academic supervisor: Prof. Frank Nielsen
    https://www.lix.polytechnique.fr/~nielsen/
PHD Defense

Back To News