Posted on 21 November 2024
in events
Lecture: Classifying unstructured texts into 1800 categories!
Problem: In this presentation, I will examine the development of a text classifier created by the team at CompanyHouse AG to address the challenge of classifying unstructured texts that describe companies’ activities into the official German industry codes, WZ 2008. Over the years, we have experimented with various techniques to manage classification across a vast number of categories (1,800 in total). I will discuss the strategies we employed to tackle this complexity and demonstrate the evolution of our model from a random forest classifier to an innovative solution based on large language models and retrieval-augmented generation (RAG) techniques.
Methodology: Our approach includes a range of methodologies: multiclass classification, retrieval-augmented generation (RAG), random forest classifiers, similarity algorithms, embedding techniques, and the use of vector databases.
Conclusions: Integrating additional knowledge into models using retrieval-augmented generation combined with similarity algorithms and techniques such as chain-of-thought reasoning can effectively address complex multiclass classification problems. This approach achieves high evaluation scores and outperforms pre-trained classifiers.
embedding, few shot, langchain, LLM, NLP, Python, text similarity
[Top]