DB-EnginesextremeDB - solve IoT connectivity disruptionsEnglish
Deutsch
Knowledge Base of Relational and NoSQL Database Management Systemsprovided by Redgate Software

Blog > Post

Cloud-Based Analytics With SingleStoreDB

by  Akmal Chaudhri, SingleStore (sponsor) , 9 June 2022
Tags: SingleStore

This is a sponsored post.

The advent of cloud computing has provided the means to develop a wide range of new products and services. We can build and run applications and perform analytics in the cloud without installing local software. This article will look at two examples of modern systems that can seamlessly integrate with SingleStoreDB for in-cloud analytics: Apache Spark™️ and MindsDB.

The notebook file used in this article is available on GitHub.

Create Cloud Accounts

Create a SingleStore Cloud Account

First, we’ll create a free Cloud account on the SingleStore website. At the time of writing, the Cloud account from SingleStore comes with a $500 credit. We’ll create our cluster and note the host address and password as we’ll need this information later.

We’ll use the SQL Editor to create two new databases, as follows:

CREATE DATABASE IF NOT EXISTS iris_db;
CREATE DATABASE IF NOT EXISTS btc_db;

Create a MindsDB Cloud Account

Next, we’ll create a free Cloud account on the MindsDB website. We’ll note our login email address and password as we’ll need this information later.

Create a Deepnote Account

Finally, we’ll create a free account on the Deepnote website. Once logged in, we’ll create a Deepnote MindsDB integration, as shown below:

Edit MindsDB integration

  • Integration name: MindsDB
  • Hostname: cloud.mindsdb.com
  • Port: 3306
  • Username: TO DO
  • Password: TO DO
  • Database: mindsdb

The TO DO for Username and Password should be replaced with the login values for MindsDB Cloud. We’ll use mindsdb for the Database name.

We’ll now create a new Deepnote project which will give us a new notebook. We’ll add the MindsDB integration to the notebook.

In our Deepnote project, we’ll create two sub-directories, /csv and /jar, to store CSV data files and JAR files, respectively.

 

1. Apache Spark

Apache Spark is an open-source engine for analytics and data processing. Spark can utilise the resources of a cluster of machines to execute operations in parallel with fault tolerance. A great way to test Spark without installing it locally is by using Deepnote. Deepnote provides instructions for installing Spark and offers a notebook environment that allows Exploratory Data Analysis (EDA), all entirely cloud-based. This would be perfect for many use-cases.

Let’s look at how we could load some data into Deepnote, perform some data analysis and then write the data into SingleStoreDB.

After creating a project in Deepnote, we can use the SingleStoreDB Spark Connector — rather than using SQLAlchemy or JDBC directly.The SingleStoreDB Spark Connector provides many benefits, including:

  • Implements as a native Spark SQL plugin
  • Accelerates ingest from Spark via compression
  • Supports data loading and extraction from database tables and Spark Dataframes
  • Integrates with the Catalyst query optimiser and supports robust SQL Pushdown
  • Accelerates machine-learning (ML) workloads

We need three jar files uploaded into our project directory in Deepnote:

We'll use the Iris flower data set for a quick and simple example of data that we could load into SingleStoreDB. We can obtain the CSV file data and upload it into our project directory in Deepnote.

Next, we’ll need to install PySpark from our notebook:

! sudo apt-get update
! sudo mkdir -p /usr/share/man/man1
! sudo apt-get install -y openjdk-11-jdk
! pip install pyspark==3.2.1

 

and then we can prepare our SparkSession:

from pyspark.sql import SparkSession
spark = (SparkSession
           .builder
           .config(
               "spark.jars",
               "jars/singlestore-jdbc-client-1.0.1.jar, \
                jars/singlestore-spark-connector_2.12-4.0.0-spark-3.2.0.jar, \
                jars/spray-json_3-1.3.6.jar"
           )
           .getOrCreate() )

 

We can check the version of Spark as follows:

spark.version

 

The output should be:

'3.2.1'

 

Next, let’s provide connection details for SingleStore DB:

server = "TO DO"
password = "TO DO"
port = "3306"
cluster = server + ":" + port

 

The TO DO for server and password should be replaced with the values obtained from SingleStoreDB Cloud when creating a cluster. We’ll now set some parameters for the Spark Connector:

spark.conf.set("spark.datasource.singlestore.ddlEndpoint", cluster) spark.conf.set("spark.datasource.singlestore.user", "admin") spark.conf.set("spark.datasource.singlestore.password", password) spark.conf.set("spark.datasource.singlestore.disablePushdown", "false")

 

We can now create a new Spark Dataframe from our Iris CSV data file:

iris_df = spark.read.csv( "data/iris.csv", header = True, inferSchema = True )

 

and we can view the data:

iris_df.show(5)

 

The output should be similar to the following:

+------------+-----------+------------+-----------+-----------+
|sepal_length|sepal_width|petal_length|petal_width|    species|
+------------+-----------+------------+-----------+-----------+
|         5.1|        3.5|         1.4|        0.2|Iris-setosa|
|         4.9|        3.0|         1.4|        0.2|Iris-setosa|
|         4.7|        3.2|         1.3|        0.2|Iris-setosa|
|         4.6|        3.1|         1.5|        0.2|Iris-setosa|
|         5.0|        3.6|         1.4|        0.2|Iris-setosa|
+------------+-----------+------------+-----------+-----------+
only showing top 5 rows
 
At this point, we can perform a wide range of analyses on our data, such as:
(iris_df
   .groupBy("species")
   .count()
   .show()
)

The output should be similar to the following:

+---------------+-----+
|        species|count|
+---------------+-----+
| Iris-virginica|   50|
|    Iris-setosa|   50|
|Iris-versicolor|   50|
+---------------+-----+

We have three different species of flowers with 50 records each.

We can also find additional information as follows:

(iris_df
   .describe(
       "sepal_length",
       "sepal_width",
       "petal_length",
       "petal_width"
   )
   .show()
)

The output should be similar to the following:

+-------+-------------+------------+-------------+------------+
|summary| sepal_length| sepal_width| petal_length| petal_width|
+-------+-------------+------------+-------------+------------+
|  count|          150|         150|          150|         150|
|   mean| 5.8433333333| 3.054000000|3.75866666666|1.1986666666|
| stddev|0.82806612797|0.4335943113|1.76442041995|0.7631607417|
|    min|          4.3|         2.0|          1.0|         0.1|
|    max|          7.9|         4.4|          6.9|         2.5|
+-------+-------------+------------+-------------+------------+

We can also create a PCA visualization using an example from Plotly:

# https://plotly.com/python/pca-visualization/
import plotly.express as px
from sklearn.decomposition import PCA
import pandas as pd

pandas_iris_df = iris_df.toPandas()

X = pandas_iris_df"sepal_length", "sepal_width", "petal_length", "petal_width"

pca = PCA(n_components = 2)
components = pca.fit_transform(X)

pca_fig = px.scatter(
   components,
   x = 0,
   y = 1,
   color = pandas_iris_df["species"]
)

pca_fig.show()

 

The output should be similar to the following: Principal components analysis

 

Further analysis could include finding correlations, for example:

import matplotlib.pyplot as plt

import seaborn as sns
sns.heatmap(
   pandas_iris_df.corr(),
   cmap = "OrRd",
   annot = True
)
plt.title("Correlations")
plt.plot()

 

The output should be similar to to the following figure.

 

And even further research is possible. There are numerous possibilities.

We can slice and dice the data in this interactive cloud-based environment, perform various analyses, build charts, graphs and more. When we are ready, we can save our data into SingleStoreDB:

(iris_df.write
   .format("singlestore")
   .option("loadDataCompression", "LZ4")
   .mode("overwrite")
   .save("iris_db.iris")
)

 

Next, let’s look at how we could use Machine Learning (ML) in the cloud with SingleStoreDB.

 

2. MindsDB

MindsDB is an example of an AutoML tool with support for in-SQL machine learning . It can seamlessly integrate with SingleStoreDB. Let’s explore how.

We’ll adapt a Time-series forecasting example for Bitcoin. We can obtain the CSV file data and upload it into our project directory in Deepnote.

First, we’ll define our Spark Dataframe schema:

from pyspark.sql.types import *
btc_schema = StructType([
   StructField("ds", DateType(), True),
   StructField("y", FloatType(), True)
])

 

Next, we’ll read the CSV file data into a Spark Dataframe using the schema we just defined:

btc_df = spark.read.csv(
      "data/btc_data.csv",
      header = True,
      dateFormat = "YYYY-MM-DD",
      schema = btc_schema
)

 

and we can view the data:

btc_df.show(5)

 

The output should be similar to the following:

+----------+---------+
|        ds|        y|
+----------+---------+
|2011-12-31| 4.471603|
|2012-01-01|     4.71|
|2012-01-02|      5.0|
|2012-01-03|   5.2525|
|2012-01-04|5.2081594|
+----------+---------+
only showing top 5 rows
 

Our time-series data consists of the Bitcoin price (y) on a particular date (ds).

We’ll now save our data into SingleStoreDB:

(btc_df.write
   .format("singlestore")
   .option("loadDataCompression", "LZ4")
   .mode("overwrite")
   .save("btc_db.btc")
)

 

The following cell should be run if this particular Minds DBDATASOURCE was previously created:

DROP DATASOURCE btc_data;

 

and then we’ll create the DATASOURCE:

CREATE DATASOURCE btc_data
WITH ENGINE = "singlestore",
   PARAMETERS = {
      "user" : "admin",
      "password" : "TO DO",
      "host" : "TO DO",
      "port" : 3306,
      "database" : "btc_db"
   }

 

The TO DO for password and host should be replaced with the values obtained from SingleStoreDB Cloud when creating a cluster.

We can now view the table contents:

SELECT *
FROM btc_data.btc
ORDER BY ds
LIMIT 5;

 

The output should be similar to the following:

+------------+---------+
| ds         | y       |
+------------+---------+
| 2011-12-31 | 4.4716  |
| 2012-01-01 | 4.71    |
| 2012-01-02 | 5.0     |
| 2012-01-03 | 5.2525  |
| 2012-01-04 | 5.20816 |
+------------+---------+
5 rows in set

 

We want to build a model that predicts the price (y). Let’s create our predictor.

The following cell should be run if the PREDICTOR was previously created:

DROP PREDICTOR mindsdb.btc_pred;

 

and we’ll create a new PREDICTOR:

CREATE PREDICTOR btc_pred
FROM btc_data
   (SELECT * FROM btc)
PREDICT y
ORDER BY ds
WINDOW 1;

 

In this first model-building iteration, we’ll use the WINDOW value set to 1. This will use the last row for every prediction. We can see the status by running and re-running the following command:

SHOW PREDICTORS;

 

The status field will change from generating ➔ training ➔ complete. Once complete, we can get the actual and predicted Bitcoin prices as follows:

SELECT orig_table.ds AS date, pred_table.y AS predicted_y,
orig_table.y AS actual_y
FROM btc_data.btc AS orig_table
JOIN mindsdb.btc_pred AS pred_table
WHERE orig_table.ds > '2011-12-30'
LIMIT 10;

 

The output should be similar to the following:

+------------+--------------------+----------+
| date       | predicted_y        | actual_y |
+------------+--------------------+----------+
| 2020-09-14 | 10388.046619416727 |  10314.3 |
| 2020-09-13 | 10337.436201070366 |  10409.8 |
| 2020-09-12 | 10298.611981465672 |  10363.9 |
| 2020-09-11 | 10275.237535814897 |  10299.2 |
| 2020-09-10 | 10157.706201274084 |  10329.1 |
| 2020-09-09 |  10163.48052954538 |  10161.5 |
| 2020-09-08 | 10156.737703604405 |  10175.6 |
| 2020-09-07 | 10209.844514505841 |  10159.2 |
| 2020-09-06 |  10335.93543291758 |  10178.4 |
| 2020-09-05 | 10588.904678141687 |  10324.6 |
+------------+--------------------+----------+
10 rows in set

and a quick way to view this tabular result would be to create a plot:

line_fig = px.line(
                pred_10_df,
                x = "date",
                y = ["actual_y", "predicted_y"]
           )
line_fig.show()

 

An example of the output this may produce is shown below.

To make a new prediction, we can use the following:

SELECT orig_table.ds AS date, pred_table.y AS predicted_y,
orig_table.y AS actual_y
FROM btc_data.btc AS orig_table
JOIN mindsdb.btc_pred AS pred_table
WHERE orig_table.ds > LATEST;

 

The output should be similar to the following:

+------------+-------------------+----------+
| date       | predicted_y       | actual_y |
+------------+-------------------+----------+
| 2020-09-14 |              NULL |  10314.3 |
| NULL       | 7537.247701719254 |     NULL |
+------------+-------------------+----------+
2 rows in set

 

We could also now save all the predictions into SingleStoreDB for future use. First, we’ll need to get all the predictions, rather than just 10:

SELECT orig_table.ds AS date, pred_table.y AS predicted_y,
orig_table.y AS actual_y
FROM btc_data.btc AS orig_table
JOIN mindsdb.btc_pred AS pred_table
WHERE orig_table.ds > '2011-12-30';

 

Then, we’ll create a new Spark Dataframe:

btc_pred_df = spark.createDataFrame(pred_all_df)

 

and write the Dataframe to SingleStoreDB:

(btc_pred_df.write
   .format("singlestore")
   .option("loadDataCompression", "LZ4")
   .mode("overwrite")
   .save("btc_db.pred")
)
 

Summary

This article has shown two quick examples of how easily SingleStoreDB can integrate with other modern technologies to provide a smooth development workflow. In both cases, we developed our solutions entirely in the cloud.

The SingleStoreDB Spark Connector provides a seamless experience when working with Spark Dataframes. We can read and write Spark Dataframes directly without converting them to other data representations, and do not need to use cursors or loops.

The MindsDB integration provides developers with the ability to build ML models quickly using the familiar SQL language syntax.

 
About the Author:
Akmal Chaudhri portrait photo

Akmal Chaudhri is a Senior Technical Evangelist at SingleStore.

Based in London, Akmal’s career spans work with database technology, consultancy, product strategy, technical writing, technical training and more. Through his work, Akmal helps build global developer communities and raise awareness of technology.

You can reach him through Twitter and LinkedIn.





Share this page

Featured Products

Datastax Astra logo

Bring all your data to Generative AI applications with vector search enabled by the most scalable
vector database available.
Try for Free

Neo4j logo

See for yourself how a graph database can make your life easier.
Use Neo4j online for free.

SingleStore logo

The database to transact, analyze and contextualize your data in real time.
Try it today.

RaimaDB logo

RaimaDB, embedded database for mission-critical applications. When performance, footprint and reliability matters.
Try RaimaDB for free.

Milvus logo

Vector database designed for GenAI, fully equipped for enterprise implementation.
Try Managed Milvus for Free

Present your product here