Run KMeans with VidarDB

VidarDB natively supports AI, which is built on MADlib allowing you to perfrom various commonly-used AI algorithms in SQL easily. This page gives you a good example of using KMeans clustering algorithm to train data in VidarDB. It assumes VidarDB is already installed. If not, please head over to the Run VidarDB with Docker section.

Preparation

Make sure psql is installed on your computer. If not, for Debian users:

sudo apt-get install postgresql-client

Clone and go into the VidarDB's util repository:

git clone https://github.com/vidardb/util.git && cd util

Import Chicago Taxi Trips data:

cd ./util-notebook/kmeans/ && psql -h 127.0.0.1 -p 5432 -U postgres -f ./import-data_chicago_taxi_trips.sql

This will also create a new database called chicago_taxi_trips.

Then, create a new table chicago_taxi_trips_change for the data training:

-- connect to the local vidardb
psql -h 127.0.0.1 -p 5432 -U postgres

-- connect to the database
\c chicago_taxi_trips postgres;

-- create new table
DROP TABLE IF EXISTS chicago_taxi_trips_change;

-- double precision array for (pickup_latitude,pickup_longitude)
CREATE TABLE chicago_taxi_trips_change
(
    row_id SERIAL,
    taxi_id INT,
    pickup_latitude DECIMAL(10, 2),
    pickup_longitude DECIMAL(10, 2),
    row_vec DOUBLE PRECISION[]
);

-- insert data
INSERT INTO chicago_taxi_trips_change (taxi_id,pickup_latitude,pickup_longitude, row_vec)
SELECT taxi_id,
       pickup_latitude,
       pickup_longitude,
       array_cat(array[pickup_latitude], array[pickup_longitude])
FROM chicago_taxi_trips;

Now, the preparation is done. Let's move to the training part.

Data Training

This section is very simple and straightforward. Before training data, we need to enable AI support in chicago_taxi_trips database, assuming username is postgres without password. We can issue the following command in the shell:

docker exec -it vidardb sh -c "install-madlib.sh -U postgres -D chicago_taxi_trips"

Try the following SQL commands to train data with KMeans:

-- connect to the local vidardb
psql -h 127.0.0.1 -p 5432 -U postgres

-- connect to the database
\c chicago_taxi_trips postgres;

-- create new table
DROP TABLE IF EXISTS km_result;

-- run kmeans algorithm
CREATE TABLE km_result AS
SELECT * FROM madlib.kmeanspp(
    'chicago_taxi_trips_change',   -- Table of source data
    'row_vec',                     -- Column containing point co-ordinates 
    5,                             -- Number of centroids to calculate
    'madlib.squared_dist_norm2',   -- Distance function
    'madlib.avg',                  -- Aggregate function
    20,                            -- Number of iterations
    0.001                          -- Fraction of centroids reassigned to keep iterating 
    );

Perform Clustering Predictions

Predict the cluster_id of each trip with the model we have just trained:

SELECT trips_data.*, (madlib.closest_column(centroids, row_vec)).column_id AS cluster_id
    FROM chicago_taxi_trips_change AS trips_data, km_result
    ORDER BY trips_data.row_id DESC LIMIT 10;

We can get some results like this:

 row_id | taxi_id | pickup_latitude | pickup_longitude |           row_vec            | cluster_id 
--------+---------+-----------------+------------------+------------------------------+------------
    999 |    7040 |           41.88 |           -87.63 | {41.880994471,-87.632746489} |          1
    998 |    7145 |           41.88 |           -87.64 | {41.879255084,-87.642648998} |          1
    997 |    7864 |           41.79 |           -87.75 | {41.785998518,-87.750934289} |          3
    996 |    6620 |           41.90 |           -87.62 | {41.89503345,-87.619710672}  |          0
    995 |     393 |           41.89 |           -87.63 | {41.892072635,-87.628874157} |          0
    994 |    1082 |           41.89 |           -87.62 | {41.890922026,-87.618868355} |          0
    993 |      55 |           41.88 |           -87.63 | {41.880994471,-87.632746489} |          1
    992 |    7749 |           41.89 |           -87.63 | {41.892042136,-87.63186395}  |          0
    991 |    7564 |           41.88 |           -87.62 | {41.884987192,-87.620992913} |          0
    990 |    7065 |           41.89 |           -87.63 | {41.892507781,-87.626214906} |          0

Reference

The data this example used come from Taxi Trips | City of Chicago | Data Portal.

(More examples will come soon!)