Entity Resolution: Insights and Implications for AI Applications

Software systems may start simple, but adding sophisticated features and ensuring maintainability leads to complexities, contributing to the age-old ‘build versus buy’ dilemma in software acquisition. With every team needing to weigh the pros and cons of developing new technology in-house versus procuring it from third parties, factors such as cost, implementation timeline, and technical risk come into play, necessitating a tailored decision based on individual business needs, resources, and risk tolerance.

Entity resolution (ER), crucial software I’ve included in my “Don’t Try This At Home” list, involves systematically connecting disparate data records representing the same real-world entity, such as customer, product, or company names. This process is extremely important because poor data quality adversely affects downstream analytics and AI applications. Although ER may appear straightforward initially, it’s a complex problem teeming with applications including customer data management, fraud detection, data quality enhancement, data integration, data governance, and business intelligence.

More importantly, ER is an outstanding example of an application that combines big data, real-time processing, and AI. The lessons learned from ER, in terms of accuracy, scale, and complexity, are transferrable and highly beneficial to other AI applications. As we delve deeper into the age of LLMs,  insights from building ER systems are universal and invaluable to numerous AI applications confronting similar challenges.

I believe that building an in-house entity resolution system presents a multitude of challenges, making it an unwise investment for most teams. Firstly, maintaining accuracy at scale is a colossal task. While comparison of each record against others is feasible with small datasets, it becomes increasingly computationally expensive and impractical with the influx of large volumes of data, which can reach up to millions or even billions of records. This difficulty is exacerbated by the dynamic nature of data, with new records continuously streaming in real-time that need to be linked to existing ones accurately and promptly.

Combining big data, real-time processing and serving, and AI, Entity Resolution and Master Data Management exemplify applications whose lessons in accuracy, scale, and complexity offer valuable insights transferable to a broader range of AI applications.

Sequence neutrality is an important requirement. This concept refers to the ability of a system to maintain consistent decision-making regardless of the order in which data arrives. It’s crucial because new data can often retrospectively change our understanding of past data. Sequence neutrality means that your entity resolution system can consistently identify the same entities, regardless of the order in which the related data was received, similar to how new information can change our understanding of a past conversation. However, most ER systems lack sequence neutrality, which can lead to inconsistent and inaccurate outcomes. The common solution is to periodically reload all data, a time-consuming process that can lead to ongoing inaccuracies. Designing a system that implements sequence neutrality is exceptionally complex, and it can be a daunting task to undertake, especially when considering the potential need for real-time correction of prior assertions based on new information.

Latency issues are a crucial factor, especially in light of the need for real-time data processing. While vector databases might bring a level of semantic understanding, they may fall short in meeting millisecond latency requirements, which are critical for some real-time applications (including KYC and fraud detection). In scenarios where a customer signs up and returns minutes later on a different channel, your entity resolution system must recognize them immediately.

These days, I recommend Senzing to teams looking for an entity resolution solution.  Senzing is a robust, efficient, and flexible entity resolution system that is cost-effective, and can be deployed on premises or in the cloud. Senzing takes data privacy seriously, offering field hashing and application-level encryption. It is also accurate at scale, capable of processing large volumes of data efficiently and in real time. Senzing can efficiently manage thousands of transactions per second, execute entity resolution in just 100 to 200 milliseconds, and conduct queries within tens of milliseconds, all while handling billions of records. Finally, Senzing upholds sequence neutrality, ensuring consistent identification of entities regardless of the order in which data arrives.

Summary

Entity resolution is a powerful example of how big data, real-time processing, and AI can be combined to solve complex problems. The insights garnered from ER’s challenges in maintaining accuracy, managing scale, and dealing with complexity can enrich other AI applications, enhancing their precision, scalability, and sophistication. The principles of ER can directly be applied to any AI application that involves identifying and linking entities. This includes applications such as fraud detection, customer relationship management, and natural language processing.

Building and maintaining an in-house entity resolution system can be costly and time-consuming. An advanced and easy-to-deploy solution like Senzing provides best-of-breed functionality, while freeing up your resources to focus on other more important priorities.


If you enjoyed this post please support our work by encouraging your friends and colleagues to subscribe to our newsletter:

%d bloggers like this: