Spash - Organizing your Big Data with SSH and Bash

Spash is a command line tool for Big Data platforms that simulates a real Unix environment, providing most of the commands of a typical Bash shell on top of YARN, HDFS and Apache Spark.

Spash uses the HDFS APIs to execute simple file operations and Apache Spark to perform parallel computations on big datasets. With Spash, managing a Big Data cluster becomes as natural as writing bash commands.

Spash is an open source project from an idea of Nicola Barbieri. The code is here: https://github.com/nerdammer/spash.

Architecture

The Spash daemon runs on an edge node of a Big Data cluster and listens for incoming SSH connections on port 2222. Clients can connect using the OS native terminal application, or Putty on Windows.

The Spash daemon will emulate a Unix OS and leverage the power of Spark to perform efficient computations on distributed data.

Spash Architecture

The World Before Spash

For those who don’t remember the classic way of doing simple operations on HDFS, here’s a reminder:

bash$ hdfs dfs -ls /
#
##
### [wait for the JVM to load and execute]
...
##########################################
bash$ hdfs dfs -copyFromLocal myFile /
#
##
### [wait for the JVM to load and execute]
...
##########################################
bash$

Spash Makes It Easier

Just run the Spash daemon on a node of your Big Data cluster, you can connect to it using ssh user@hdfshost -p 2222 (the password is user) and then run all your favourite bash commands to manipulate data.

user@spash:/$ echo "Maybe I can provide a real life example..."
Maybe I can provide a real life example...
 
user@spash:/$ echo "My content" > myFile
 
user@spash:/$ ls
myFile
 
user@spash:/$ cat myFile
My content
 
user@spash:/$ echo text2 > myFile2
 
user@spash:/$ ls -l
drwxr-xr-x 4 user supergroup 0 30 mar 18:34 myFile 
drwxr-xr-x 4 user supergroup 0 30 mar 18:35 myFile2
 
user@spash:/$ cat myFile myFile2 > myFile3
 
user@spash:/$ cat myFile3
My content
text2
 
user@spash:/$ cat myFile3 | grep -v 2
My content
 
user@spash:/$ exit

And this is just the tip of the iceberg. From your host, you can use scp (scp myfile -P 2222 user@hdfshost:/) to copy a file into hdfs. You can even use FileZilla or WinSCP to browse HDFS.

Spash works on Cloudera CDH 5 and many other platforms.

Spash is a proof of concept and needs contributions. Look at how to contribute and get involved on the project home page: https://github.com/nerdammer/spash.

Share on

Twitter Facebook LinkedIn

Spash - Organizing your Big Data with SSH and Bash

Nicola Ferraro

Architecture

The World Before Spash

Spash Makes It Easier

Share on

You may also enjoy

Camel meets KEDA

Low Code Camel

From Camel to Kamelets: new connectors for event-driven applications

Camel K 1.4.0 Released