The Rise of Synthetic Data
The hunt for data has become the new gold rush in 2019. Every company is trying to harvest their own data and acquire whatever will help their firm become more efficient and profitable. The end result is that all data is dramatically rising in value and in cost, creating new challenges for the use and acquisition of data. And this is a huge problem as we are a society heading towards a future driven by machine learning.
As machine learning becomes more prevalent in our society, whether at a small scale in training mobile personal assistants to recognize the human voice or video surveillance or simply predicting the next word on a Google search, there has been, at the very core of it, a standard mechanism whereby a computer becomes intelligent: training data.
Whether engineers use unsupervised or supervised machine learning models, computers rely on new data to handle new situations they might encounter. However, recently, there has been a rise in the use of synthetic training data, ones that are fabricated by existing AI models, and its value is tremendous.
At the surface, data is expensive. To train a computer with millions of data points, all manually categorized, is extremely labor- and time- intensive. However, the rise of synthetic data is a solution. Small companies wishing to develop proprietary training models and related products now have a far cheaper, and thus significantly more accessible, system to build them.
Beyond the pricing and accessibility of training data, synthetic training data has the potential to be more expansive and contain a broader range of training information. For example, researchers at MIT found that the core reason that facial-recognition AI models had trouble identifying people of color and other minorities was that the sets of training data usually fed to them were overwhelmingly white. Synthetic training data can overcome that obstacle. The inherent presence of a majority in a set of real training data, as a result of the presence of a majority group of data in society, acts as a fundamental barrier in AI’s ability to recognize a minority. Synthetic data, however, does not face that obstacle; one can simply generate a diverse set of data and train machine learning models with that.
However, despite the many merits of using synthetic training data, the technology is not yet mature enough to be used in very sensitive situations; it is not as accurate as real data. For example, self-driving cars require real-life testing data to be able to navigate dangerous situations, so using fabricated data could pose a significant threat to society. In addition, the models used to generate the data are all based on set algorithms and are thus pseudorandom, while the best, most accurate AI models contain several truly random data and are thus far more extensive in their behaviors and can handle far more intricate situations.
Despite its drawbacks, synthetic data has the potential to pave the way for machine learning models to be trained at lower costs, more extensively, and if it continues developing at the rate it is, equally as accurately as real ones.
Written by Vishal Dhileepan & Edited by Alexander Fleiss