Controlled Vocabulary Tool

A multi-threaded Python CLI tool to create a controlled vocabulary.


I was inspired to start this project while reading Data+Design by Trina Chiasson and Dyanna Gregory (and over 50 global contributors). Chapter 4 contains a section named Controlling for Inconsistencies. Upon reading it I figured it was a good simple problem that can benefit from a multi-threaded solution. I decided to work in Python for quick and easy development.

The main idea of the project was to create a CLI tool that, using multiple threads, can convert a dataset into a new dataset with a controlled vocabulary (see image below). The multiple threads should utilize the fact that the user giving a mapping for a value takes a few seconds. During this time the tool can read input, convert values for which it already has a mapping, and output mapped values.

illustration of controlled vocabulary
Illustration of controlled vocabulary taken from Data+Design.

With this goal in mind I first drafted a single-threaded script that read input from a file, prompted the user for a conversion and displayed the resulting dataset in the console. The next step was to do this on three different threads: one to read, one to convert, and one to output. To let the threads communicate with each other I created a few global variables accompanied by mutexes to align usage of the global variables between the threads, see the diagram below.

diagram of global variable usage by different threads
Diagram of global variable usage by different threads.

Finally I seperated the conversion thread into a conversion and prompting thread, so the tool can convert newly read values while the user is being prompted for a mapping. And also added an option to output to a file, fuse multiple input files into one output file, output the conversion map created, and to use an existing mapping for conversion.