Problem Statement

Data annotation is essential for building supervised learning-based machine learning algorithms and  assessing their quality or performance.

Although, semi-supervised approaches are becoming more and more popular but the data requirement for these techniques is considerably more than supervised learning techniques. Similarly, for low resource problems, where a lot of data is not available, data annotation can be useful. An example for such a case is Urdu Speech Recognition or Urdu Speech to Text. A lot of Urdu speech data is not readily available and certainly labeled or annotated data is not. Hence, even if a system can be developed using unsupervised techniques to test the system annotated data is required.

Proposed Solution

We have developed an in-house solution to annotate data for our algorithms. Whether it be data for our automatic speech recognition system (ASR) or for our aspect based sentiment analysis (ABSA) system or for our crop classification system our team of expert data annotators is capable of annotating text and images. Specifically, our dedicated team of annotators is capable of typing in Urdu thus making the process of data preparation easy and straight forward. With annotation, our goals are as follows:

Guidelines

We ensure that the above goals are met by using the guidelines provided below:
  • Determine the annotation task:
    The first step in data annotation is to determine what information needs to be added to the dataset.
  • Define the annotation guidelines:
    Once the annotation task has been defined, clear guidelines need to be established for annotators to follow. These guidelines are specific, unambiguous, and cover all possible scenarios that may arise during the annotation process.
  • Select and train annotators:
    The next step is to select and train a team of annotators. Depending on the complexity of the annotation task, annotators may need to have specialized knowledge or skills. They must be trained on the annotation guidelines and the tools they will use for the task, especially if the task involves understanding specific aspects like the sentiments of a review.

data annotation checklist

Case Study: Annotation of Urdu Audios for Urdu Speech to Text system

We identified that to annotate the audios the annotators must be trained at typing Urdu. This would help reduce the errors and speed up annotation. We also identified that in order to ensure consistency we need to get each audio annotated by at least two annotators. To conform to standards, we used acceptable formats for maintaining text related to audios. The details of the procedure are given below.

  • To ensure quality we do not provide annotators audio and ask them to transcribe by listening to audios. We use pre-processing and split up long audio conversations into individual sentences using machine learning based diarization resulting in Rich Transcription Time Marked (RTTM) files.
  • By using these diarized audios and an Automatic Speech Recognition system (ASR), a based line transcription of text is generated. This text is used to generate ELAN Annotation Format (EAF) file which is shared with the annotators. (Since the ASR system is not well trained there will be errors in transcription)
  • Data Annotation (Manual): The annotators listen to the audio files and update the transcription provided in the associated ELAN files, manually.
  • Data Validation (Manual): To ensure quality data is annotated by two annotators and files that have different annotation are checked by annotation manager.
  • Vocabulary Generation: Data generated after validation is used to develop the vocabulary of unique words used in the ASR.
data annotation process

Ready to leverage the power of precise data annotation in your AI-driven solutions?

Contact us today to explore how our specialized annotation services can empower your business growth.