Improving Data Privacy in AI Systems Using Secure Multi-Party Computation

In the financial services sector and beyond, accessing comprehensive data for building models and reports is a critical yet challenging task. During my time working in financial services, we aimed to use data to understand customers fully, but siloed information across separate systems posed significant obstacles to achieving a complete view. This issue underscores the importance of data sharing—a theme so central it inspired the name of my podcast, The Data Exchange.

As reliance on analytics and AI grows, data privacy becomes increasingly critical. Teams often need access to sensitive datasets containing personally identifiable information (PII) or proprietary data to build accurate models. However, regulations and policies restrict data sharing and direct access, creating roadblocks for data science and machine learning teams. Common challenges include lengthy access wait times, resorting to synthetic data that offers subpar accuracy, and the inability to collaborate across teams or with partners.

Limitations of Current Approaches

Various methods have been employed to mitigate these risks. Techniques like hashing or adding noise to data, while common, do not provide robust protection during collaborative computations. Federated Learning, despite its decentralized approach, lacks formal privacy guarantees and is vulnerable to attacks.

Data clean rooms offer a controlled environment for sensitive data collaboration but often limit analysis and rely on vendor-specific tools. By providing a digital vault for private data sharing and analysis under strict privacy protocols, clean rooms have emerged as a vital tool. However, relying on vendor-provided clean rooms restricts analytical capabilities and depends on vendor timelines and tools, reducing user control.

In contrast, building proprietary clean rooms in-house offers greater flexibility, oversight, and long-term cost savings, while also fostering internal data and AI expertise. Meanwhile, compliance relies more on internal oversight than on vendor support. A hybrid approach, combining rented and proprietary clean rooms, balances vendor expertise with control over core data storage and IP.

Traditional clean rooms risk data exposure when moved from protected environments, even with data obfuscation. However, recent advances in secure multi-party computation (SMPC) pave the way for an open, decentralized software model for encrypted data computation on-premise, eliminating the need for external data transfer.

An Innovative Approach with Secure Multi-Party Computation

SMPC overcomes traditional data privacy method limitations by allowing multiple parties to compute functions on sensitive data without exposing the underlying data. It’s akin to a group of people solving a puzzle together without revealing their individual pieces. SMPC is a type of cryptography that enables joint computation of a function while keeping private inputs hidden, even during collaboration.

Pyte, a startup specializing in SMPC, offers scalable and performant solutions that allow teams to work on real datasets without exposing sensitive information. Their tools, designed for compliance and governance, ensure data confidentiality and prevent leaks and copyright issues. Pyte’s SecureMatch platform enables the discovery of audience overlaps and data activation with partners, ensuring data security and control. PrivateML allows data scientists to securely experiment with machine learning models on real datasets without direct access to sensitive data. 

SMPC also enables crucial new use cases for LLMs that involve sensitive or proprietary data from multiple organizations. For example, law firms, financial institutions, and healthcare companies could leverage LLMs to process confidential contracts, transactions, or medical records without having to disclose the actual content to any third party. While SMPC introduces additional computational complexity, for many regulated industries this is an acceptable tradeoff for the privacy guarantees it provides. More broadly, SMPC unlocks opportunities for secure collaboration – multiple organizations could jointly fine-tune an LLM on their combined datasets to produce a superior model, without ever having to share raw data with each other. The overhead of SMPC is relatively small compared to the major gains in privacy, compliance, and expanded access to sensitive data.

The Future of Privacy-Preserving Analytics

Pyte’s solutions have already shown significant impact in financial services, enabling faster and more collaborative data science on sensitive data. Banks and brokerages have been able to reduce the time for accessing sensitive datasets from over six months to 1-6 months, and a leading consumer packaged goods company has been able to collaborate on a machine learning model for product discovery. Looking ahead, Pyte plans to integrate with various data platforms and introduce AutoML capabilities for collaborative machine learning, further enhancing the potential of SMPC in AI and analytics.

With powerful privacy-preserving technologies like SMPC, the barriers to accessing and utilizing sensitive data are being dismantled, empowering organizations to leverage analytics and AI fully. This breakthrough marks a significant stride forward in the realm of privacy-preserving analytics, setting a new standard for how data is utilized in the financial services industry and beyond.


Update (2024-03-21): Pyte’s CEO and co-founder, Sadegh Riazi was a guest on The Data Exchange podcast.


If you enjoyed this post please support our work by encouraging your friends and colleagues to subscribe to our newsletter:

Discover more from Gradient Flow

Subscribe now to keep reading and get access to the full archive.

Continue reading