Contents

What is synthetic data?

Synthetic data is artificially generated data that is not collected from real world events. It replicates a real dataset, maintaining its properties regarding statistical fidelity and analytical utility and guaranteeing varying degrees of privacy. It can be used for:

Guaranteeing privacy and compliance when sharing datasets (for quality assurance, product development and other analytics teams)
Removing bias by upsampling rare events
Balancing datasets
Augment existing datasets to improve the performance of machine learning models or use in stress testing
Smartly fill in missing values based on context
Simulate new scenarios and hypothesis

Synthetic data is not, however, a total novelty. Techniques have existed for long, but have struggled with high dimensionality datasets, high cardinality categorical data, reliance on manual parameters and bias handling. Recent advances in the state-of-the-art have delivered improvements in these aspects.

<aside> 👉 To learn more about Synthetic Data, check out “Everything You Always Wanted to Know About Synthetic Data”.

</aside>

And what are Synthesizers?

YData’s Synthesizers offer a simplified interface to train, asses the quality and interact with state-of-the-art Machine Learning models capable of generating data mimicking specific Data Catalog. Synthesizers are fully data-driven, unsupervised and automated, learning the underlying data distributions automatically while abstracting the complexity of creating and training these models (including infrastructure) through an easy-to-use interface.

The quality of synthetic data is very use-case dependent and should be evaluated in that context. As such, YData automatically generates a quality report describing each Synthesizer on its three fundamental axes of quality:

statistical fidelity when compared with the original data
analytical utility