2024/12/06 【Dec. 5(US)/Dec. 6(Taiwan)】Fast-ER:GPU-Accelerated Record Linkage in Python
Fast-ER:GPU-Accelerated Record Linkage in Python
by Dr. R. Michael Alvarez & Jacob Morrier (California Institute of Technology)
Date:Friday, December 5 - 21:00PM~11:00PM (USA Central Time, GMT-6)
Registration:https://reurl.cc/36Dz00
Abstract: Record linkage, also called "entity resolution," consists of matching observations from two datasets representing the same unit, even when consistent common identifiers are absent. This process typically involves computing string similarity metrics, such as the Jaro-Winkler metric, for all pairs of values between the datasets. The Fast-ER package accelerates these computations with graphical processing units (GPUs). It estimates the parameters of the Fellegi-Sunter model, a widely used probabilistic record linkage model, and performs the necessary data processing on CUDA-enabled GPUs. Our experiments demonstrate that this approach can increase processing speed by over 60 times, reducing processing time from hours to minutes, compared to the previous leading software implementation. This significantly improves the scalability of probabilistic record linkage and deduplication for large datasets.