Brief introduction to Alibaba’s Alink: A Flink-based ML Platform and Text Sentiment Analysis using Logistic Regression

What is Alink?

Alink is the Machine Learning algorithm platform based on Flink, developed by the PAI team of Alibaba computing platform.

Alink contains a variety of ready-to-use ML algorithms and can even compete with Python-based frameworks and libraries on the number of features.

Alink Algorithms (source: https://github.com/alibaba/Alink/blob/master/README.en-US.md)

Some of the features are also implemented in Apache Flink ML 2.0 (watch my YouTube video on Flink ML 2.0) but, to be honest, Alink looks much more comprehensive and solid. As far as I know, Alink served as the foundation (at least in part) for Flink ML 2.0.

Alink looks extremely interesting for plenty of AI-oriented tasks. Unfortunately, almost all of Alink’s documentation and discussions are in Chinese, so it was a bit challenging to get through the available information.

In this article we will pass through a short project that uses Alink to predict sentiments of given movie reviews.

But before we begin, I want to put some extra bullets to interest you in some of the Alink’s features that made me excited:

  1. Alink makes full use of Flink’s capabilities: it is a scalable, high-performant real-time streaming/batch platform built on the time-proven Flink backend
  2. Alink uses handy tooling that allows you to build ML pipelines using a set of simple abstractions: imputers, vectorizers, partitioners, and others
  3. Alink has a plethora of pre-defined ML algorithms that work in a distributed Flink fashion
  4. Alink is able to de/serialize trained model, making it possible to separate model training and inference pipelines!
  5. Alink has the Python API and is available via Jupyter Notebooks.

Did I make you nervous? 🙂

Step-by-step plan

In the article we will:

  1. Find an appropriate dataset
  2. Split the dataset into training / validation / testing samples
  3. Load the training part in a batch manner, provide some data pre-processing
  4. Train the model using the training dataset
  5. Validate the model using the validation dataset
  6. Serialize the trained model to the disk
  7. Deserialize the model and make the inference

Dataset

Kaggle helped me find an IMDB Sentiment Analysis Dataset in CSV. It is already separated into training / validation and testing sets. The CSV contains a ‘text’ and ‘label’ columns. The ‘text’ contains a movie review, while the ‘label’ takes two values: ‘0’ for a negative sentiment, and ‘1’ for a positive.

Dataset sample

Maven Project

Let’s create a new Maven project with a given structure:

And add some dependencies:

Dependencies

I’ve put datasets in the resources folder.

Now create two similar classes named ModelLoading and ModelTraining. Both should contain a psvm method (I wanted to make them separately runnable). The ModelTraining class will contain training and serialization methods (along with the validation) while the ModelLoading class will contain deserialization and validation.

public class ModelLoading {
    public static void main(String[] args) throws Exception {

}

Let’s start from model training. Open the ModelTraining and add the following code:

CsvSourceBatchOp trainingSet = new CsvSourceBatchOp()
                .setFilePath("src/main/resources/train.csv")
                .setSchemaStr("text string, label int")
                .setIgnoreFirstLine(true);

We have defined the training source, provided its schema, and forced the source to ignore a heading line.

Now do the same for the validation dataset:

CsvSourceBatchOp validationSet = new CsvSourceBatchOp()
                .setFilePath("src/main/resources/valid.csv")
                .setSchemaStr("text string, label int")
                .setIgnoreFirstLine(true);

Add this line to print the top five rows from the training set:

trainingSet.firstN(5).print();
Loaded training set

Now create a model training pipeline:

Pipeline pipeline = new Pipeline(
                new Imputer()
                        .setSelectedCols("text")
                        .setOutputCols("featureText")
                        .setStrategy("value")
                        .setFillValue("null"),
                new Segment()
                        .setSelectedCol("featureText"),
                new StopWordsRemover()
                        .setSelectedCol("featureText"),
                new DocCountVectorizer()
                        .setFeatureType("TF")
                        .setSelectedCol("featureText")
                        .setOutputCol("featureVector"),
                new LogisticRegression()
                        .setVectorCol("featureVector")
                        .setLabelCol("label")
                        .setPredictionCol("pred")
        );

The Pipeline is a directed graph that defines how the dataset will be processed during model training. It consists of several steps:

  • The Imputer – reads the text column of the training dataset and transforms values to the featureText column, replacing empty values to nulls.
  • The Segment – decomposes reviews into words separated by spaces.
  • The StopWordsRemover – is a kind of a map function that removes ‘bad’ words from decomposed reviews.
  • The DocCountVectorizer – calculates a feature vector from input words based on Term Frequencies so the vector length equals to the number of words in a review.
  • The LogisticRegression – applies the algorithm (classification model) to vectors and writes predictions to the pred column.

Now we fit (train) the model and then pass the first 100 rows from the validation dataset:

PipelineModel model = pipeline.fit(trainingSet);

model.transform(validationSet.firstN(100))
                .select(new String[]{"text", "label", "pred"})
                .firstN(5)
                .print();

It will take a while since our dataset is quite big:

Model training

Now let’s check if the model made accurate predictions:

Results

We have five results to analyze. The first one was originally labeled as bad (0) and our prediction is correct:

First prediction

The second was labeled as good (1) and the prediction is once again 100% accurate:

Second prediction

The third one is also a win!

Third result

The fourth is a huge miss:

Fourth result

And the fifth is totally correct:

Fifth prediction

Seems that the model provides quite accurate results.

Model De/Serialization

A fantastic feature of Alink is its ability to de/serialize trained models:

model.save("src/main/resources/model");

A serialized representation resembles both Json schema and CSV that contains vectors for every word that occurs in reviews:

Serialized model

You can deserialize the model to an object and call the transform:

PipelineModel modelDes = PipelineModel.load("src/main/resources/model");
        modelDes.transform(validationSet.firstN(100))
                .select(new String[]{"text", "label", "pred"})
                .firstN(5)
                .print();

Inference result will be the same as for the trained model. This exciting feature allows you to separate model training and model inference, which simplifies the MLOps lifecycle. 

For those who prefer Python

Good news! Alink supports the Python API. You can use Jupyter to work with Alink:

via https://alinklab.cn/tutorial/book_python_00_code_help.html

Conclusions

Nowadays, ML received a strong impetus for development. Almost every industry uses machine learning models to make predictions, scorings, antifraud detection and others.

Machine learning is closely connected to big data and analytics. The more data you process – the more accurate model you obtain. There aren’t as many time-tested solutions for processing data at scale while also providing strong algorithmic and machine learning support, and Alink is one of these Swiss-army-knife tools that allows you to process massive amounts of data at large distributed clusters and use the data to create ML pipelines. 

Source code

Don’t you like articles? Watch the video!