Many domains and industries would benefit from sharing data for predictive modeling but for privacy, legal/regulatory, competitive, or other reasons, parties may be unwilling or unable to share information with each other. Differential privacy (DP), whereby carefully selected random noise is added to the development process, presents a way to share (approximate) information while guaranteeing mathematically a level of privacy for each party.
Year 6 of the project developed a DP method for jointly training a deep learning model applied to credit card fraud detection. Year 6 results successfully demonstrated multi-party deep learning under certain conditions and simplifying assumptions. In Year 7, we intend to leverage the results of Year 6 and examine alternate learning solutions, as well as relax some key assumptions, moving towards more realistic settings.
Year 7 will see the introduction of dataset heterogeneity –it will no longer be assumed that each party collects the same data on their customers and/or have the same customer (i.e. Data) distribution. In this scenario, it may no longer make sense to jointly develop a single general model that fails to account for local peculiarities in each party’s data. To address this we plan to incorporate generative adversarial networks (GANs), which are models that learn the underlying structure of a dataset, and can generate synthetic data that mimics the distribution of the real data.
The methods developed in Year 6 can be used to train DP GANs, which can then be shared with other parties and used to augment the training of local models. GANs can also be used to address issues with imbalanced data – our dataset has a fraud rate of 0.14%. We also plan to utilize the discriminator model, which is often overlooked in the literature outside of training, to pre-select data from other the GANs with a view of optimizing local model development.
The primary challenge foreseen will be the development of GANs for discrete data, for which there is currently no established method, and integrating this discrete model into a joint discrete/continuous GAN.