How to build your own Dataset for a supervised learning project

Gean Matos
Share! por Ateliê de Software
7 min readApr 5, 2023

--

The idea of artificial intelligence being an almost indispensable tool to facilitate data analysis is already a common thing to see, just like guides on how to build an A.I., Machine learning or Deep Learning model, but what about how to construct your own dataset for this model?

When we are working on commercial and academic projects we face situations where having a ready to go dataset just like we can found on tutorials or study materials is not the case anymore, in reality the most common situation is to find ourselves with untreated and spreaded data that need to be shaped in a single dataset.

Although this might seem like an herculean task, it is possible to resume the process into three steps: formating, data cleaning and feature extraction. But before we get to those steps it is important for us to understand what a dataset is and why we need one for a better understanding of each one of the steps.

Machine learning projects are deeply connected to some type of data, because it is impossible for them to learn something without any access to data. Thus making it that no matter how good your model is built, if your base is not equally good made the final product will not meet the expectations.

Knowing that connection we can further analyze the structure of a dataset for a supervised model, they are splitted in two main parts.

  1. Training base: Generally composed of 60% or more of the dataset’s total is the split responsable to shape the comprehension capacity of the algorithm around the environment that you wanna achieve answers about. And sometimes this base has a part of itself destined to validation aiming the adaptability of the model to the data modifying the tuning hyperparameters.
  2. Test base: Is the rest of the base that is destined to the evaluation of the model’s performance and quality. Making it really important not to make it with some or a big part of the training base because this may lead to overfitting.

Overfitting is a common problem in A.I. development being a case where the model is trained in such a tight environment that it can only recognise that specific environment resulting in a misleading result where its shown great numbers on accuracy and a poor performance on the application with external data.

With that taken out of the way, we may continue to the steps of building our own dataset.

Formatting

On this step it is important to know where the data that we need, if they are scattered we need to reunite them in a single file and then analyze the format that they are in and decide the one that we will use for the model. The most commonly seen format for text data is CSV, a simple txt file with the data splitted with commas. So after the gathering of the data and the analysis of it’s format and the selection of the best one for our case, we need to apply the necessary changes on it. This may vary depending on the project.

Data Cleaning

Once the data is reunited and treated to respect the chosen format, we have to prepare it in a way that the A.I. may easily understand, cleaning the data.

This consists of the removal of all the data that can wrongly represent the targeted environment, such as duplicated values, null values and outliers (values that don’t correctly represent the environment).

To effectively execute this step a good comprehension of the data is required, to identify potential outliers and problematic data, that can be prejudicial.

Feature extraction

From this point on, the dataset may be considered already good for use, but there may be some unnecessary features still present on the base so it is interesting to analyze it and apply a feature extraction to guarantee that the base cover’s only the necessary data making it more precise.

An example of feature extraction would be an school scenario where want to build an algorithm that analyzes the student’s grade and classifies them as approved or not, the important features will be those that identify a student and their grades, however the base can contain information such as the student’s class, the room where he attends classes and the semester he is in. This information may be useful later on, but not for the machine learn analysis task, so it makes sense to remove it out of the dataset making it even more straightforward for training.

As well as in the previous task, this extraction depends on a good understanding of the data and the desired behavior of the machine learning algorithm, to determine which features are unnecessary and need to be removed.

Now that we already know what the techniques are and their general purpose, how to apply them?

As mentioned earlier, each step depends on the data and what its objective is, so to better clarify now we will apply a generic example that demonstrates within a minimal scope how we can apply them.

Example

We can think of a classification model, where we want to know the subject covered in an Ateliê’s post based only on the title. To do this first we must start by extracting the data.

So as each of the posts is available in its own link, we have to somehow extract the desired information that can be done manually, however this task would take a large amount of time and would be extremely tedious. So, as an alternative to manual extraction, a simple text-extracting bot built in Python fetching all the post header information did the job.

After the extraction, we have a CSV file containing the author’s name, title, subject and publication date.

import pandas as pd

dataset = pd.read_csv("dataset.csv")
dataset.head()

With the raw data in hand, let’s apply the three steps discussed:

Formatting:

In this case the only step necessary to complete the formatting of the data will be the insertion of a header, which allows a better analysis of the data, since we already have a file in the desired format (CSV) with all the data unified, it is only necessary to identify the information by including the file header.

dataset.to_csv("dataset2.csv", 
header=['autor', 'titulo', 'assunto', 'data'], index=False)

dataset2 = pd.read_csv("dataset2.csv")
dataset2.head()

Data Cleaning:

In this step we will remove duplicates, incomplete data and data elements that are not interesting for our learning model.

As we are working with a text base, we will use regex tools to remove special characters and text snippets with no value for the model.

The first step, and usually the simplest, is to remove nulls and duplicates, in the case of our example, Pandas offers specific functions for this type of action.

dataset2 = dataset2.drop_duplicates().dropna()
dataset2.head()

Now we can proceed with the direct processing of the data. An important point when dealing with text are the stop words,those are words that have no relevance to the machine and must be removed to create a more comprehensible text for the model. Some machine learning libraries offer built-in stop word arrays.

Here we will be using a simple and small version, just to exemplify. After defining the stopwords, we will use a regex function to remove numbers, accents and punctuation from the text, along with smoothing out the lowercase letters.

import re
import string
from unidecode import unidecode

stopWords = ["a", "o", "e", "na", "no", "em",
"da", "de", "para", "com", "sao", "um", "uma"]
dataset2['titulo'] = dataset2['titulo'].apply(lambda x: " "
.join(re.sub(r'[0-9'+string.punctuation+'ˆº]', '', unidecode(x))
.lower().split()))
dataset2['titulo'] = dataset2['titulo'].apply(lambda x: " "
.join(x for x in x.split() if x not in stopWords))
dataset2.head()

Feature extraction:

Finally, let’s move on to feature extraction. In this case, as we want to classify our database based on the title, there is no need for the author and date fields, so the presence of both would have a negative impact on the ML model, making it important to remove these columns.

dataset2 = dataset2.drop(columns=(['data', 'autor']))
dataset2.head()

This was a simple example of how we can shape a database for a supervised machine learning application, I hope it was useful to help you shape your own database.

If you have any questions, feel free to ask in the comments, we are always happy to help.

Follow our blog on Medium and stay inside everything that happens in Ateliê.

Say hello to us! And join us on social networks:
E-mail: contato@atelie.software
Site
Facebook
Instagram
LinkedIn

--

--