Current projects


We have created this separated webpage to index the latest machine learning tools and techniques.

Click here and have fun!


A framework to automatically filter the spam comments posted in YouTube.

For more information, click here.


It's a tool for opinion detection in English messages empowered by ensemble and state-of-the-art natural language processing techniques.

For more information, click here.

Twitter Search

This tool was designed to assist you in creating datasets with text samples extracted from Twitter. Select one subject and collect your data!

For more information, click here.


This tool is intended to help you label your machine learning datasets. You can upload your dataset as a CSV file and add colaborators (a.k.a grad students) to help label your samples.

For more information, click here.

Text Normalization and Expansion

Short text messages (e.g. posts in blogs, forums, social networks, etc) represent a challenging problem for traditional learning methods nowadays, since such messages are usually fairly short and normally rife of slangs, idioms, symbols and acronyms that make even tokenization a difficult task. In this scenario, we have designed the TextExpansion tool which aims to normalize and expand the original short and messy text messages in order to acquire better attributes and enhance the classification/clustering performance.

The proposed approach is based on lexicography and semantic dictionaries along with state-of-the-art techniques for semantic analysis and context detection. This technique is used to normalize terms and create new attributes in order to change and expand the original text samples aiming to alleviate factors that can degrade the algorithms performance, such as redundancies and inconsistencies.

For more information, click here.

PVis – Partitions’ Visualizer

Recent advances in cluster analysis highlight the importance of finding multiple meaningful partitions and point out to the need for approaches to evaluate them. They also suggest that the evaluation should consider knowledge of a domain expert. In this scenario, we present a visualization method, called PVis (Partition’s Visualizer), that allows the integrated visualization of a collection of partitions. PVis allows to compare the content of a set of partitions. The comparison can be done with respect to priori knowledge provided by an expert. PVis can be useful in the discovery of relevant information to the domain experts performing cluster analysis.

For more information, click here.

Past projects

SMS Spam Corpus

The SMS Spam Collection is a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-enconded messages, tagged according being legitimate (ham) or spam.

For more information, click here.

E-mail spam filtering

We have designed a new spam filter and the achieved results indicate that it is superior than best techniques currently available. Moreover, we have offered a comprehensive performance evaluation of term selection techniques with different spam filters.

For more information, click here.

E-Tongue Sugar Collections v.1

The e-Tongue Sugar collections v.1 are public sets of labeled sugar samples that have been collected with an electronic tongue for automatically accessing the sugar quality. It is composed by two datasets: one with 190 samples in their natural form and other one with 185 sugar samples with controlled pH.

For more information, click here.