Collection agencies record thousands - or more - customer interactions every single day for quality assurance and audit purposes. And every single call requires verification.
Agents need to verify that the person they’re speaking with is the right party before they disclose any debt details — even the fact that a debt exists.
Once verification occurs, if the debtor agrees to pay, they’ll need to share important bank account or card details. No big deal, except that this information is then stored in the call recordings.
Handling and storing personally identifiable information (PII) and payment card information (PCI) securely and safely is first priority for an agency. Any data leak or loss opens the agency and their clients up to legal risks. Perhaps most importantly, a leak or oversight will violate consumer trust.
So, what’s the solution? A reliable and accurate redaction service to protect data security. Prodigal has built a state-of-the-art Redaction AI model to solve this problem.
5 Biggest Compliance Breaches & Associated Losses
Redaction is the process of editing any piece of data, such as text, audio, video, etc., to conceal or remove information deemed to be confidential. In the context of Prodigal’s task, it is removing the PCI or/and PII information from call recordings and transcriptions.
So, what does PCI, PII consist of? You can find the broad definition here.
For our modeling purposes, we support the following entities:
Redaction is essentially a task which falls under the spectrum of the Named Entity Recognition (NER) problem in machine learning.
This is a problem where we model the probability of a word belonging to one of N labels. Each distinct label is called an entity. We have as many entities as the types of distinct information we are trying to extract/redact.
Even before NER became popular, people were solving the redaction problem using regular expressions. Solving the problem with regular expressions presents its own problem: it is fairly hard to control the regex boundary to only the entity you want to capture. You often tend to redact more than what was needed for the problem. Also, use of regexes was limited for only those entities which had a major keyword marking start/end of an entity. Such a method could not properly leverage the context of conversation to redact well.
After regexes, bidirectional long-short term memory models (Bi-LSTMs) became the popular choice for NER tasks. They are effective in reducing the problem of over-redaction as well as using context in conversation to generate high quality labels. But, the computation in Bi-LSTMs is of sequential nature which makes them hard to accelerate using GPUs. That makes training lengthy and inference slow. Additionally, Bi-LSTMs cannot use long-range context as effectively, due to signal degradation.
Transformers have sprung up recently, giving us state-of-the-art results in many NLP applications. They are highly parallelizable and can use infinite-length context, solving the major problems with Bi-LSTMs.
Prodigal uses BERT for its NER modeling task, which is a special form of Transformer.
The BERT model was proposed in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. It’s a bidirectional transformer pre-trained using a combination of masked language modeling objective and next sentence prediction on a large corpus comprising the Toronto Book Corpus and Wikipedia. Pre-training of a BERT model is a costly and lengthy affair — we thank Huggingface for their easy to use API, which helped us access a pre-trained BERT model easily.
Upon establishing which model we’d use, we used our in-house Prodigal annotation team to get our annotations completed.
Training involved a set of steps which are depicted in the below diagram.
We fine-tuned the pre-trained BERT model on token classification task with a log loss minimization objective. Masked language modeling was chosen for pre-training of BERT model to help best with the downstream task of NER where tokens could go missing due to fault of transcription engine. The model needs to make the best use of available context to make the right prediction. Other alternative modeling approaches we tried (Spacy, Distill-Bert) were not as accurate as BERT.
The validation set loss was used as objective criteria to pick the best performing checkpoint of our model. We have performed token level evaluation for our modeling exercise.
Given Prodigal’s focus on the consumer finance vertical and our Prodigal AI Intent Engine design, we handily beat out Amazon’s redaction models in accuracy."
Prediction Examples
Here's a (fictional) sample call so you can see how the redaction model works.
We've highlighted the pieces of information our AI redacts, and tagged it with the category that requires redaction. The numbers you see after those tags are our model's confidence in its accuracy - 1.0 represents total confidence.
But what if there are errors in the transcription?
Changes made to introduce noise:
social → vocal
phone number missed
card → car
CVV code missed
As expected you will observe the confidence score for those NER labels reducing as important context is being removed/altered. This also emphasizes how important context is in order for the model to make right predictions.
SSN (0.99→0.39)
phone number (1→0.96)
card number (1→0.91)
CVV code (1→0.99)
While we are outperforming some of the market leaders in redaction capabilities today, there is room to grow.
Want to join us as we build the intelligence layer of consumer finance? We're hiring!
"We found Prodigal while looking for solutions to reinforce our call recording safeguards and further protect our customers’ personal information.
We looked at multiple vendors but ultimately trusted Prodigal and their industry-trained AI models to get the job done. They were great to work with and further customized their outputs to meet our specific needs. We’d recommend them to any team looking to effectively protect their consumer data and strengthen compliance." -VP of InfoSec, Policygenius