Connecting pyspark to CrateDB inside Jupyter Notebooks

I am trying to retrieve data from my CrateDB database using pyspark in a jupyter notebook.

Here is my code:

from pyspark.sql import SparkSession
import crate
import os

spark = SparkSession.builder.appName(“ConnectToCrateDB”).getOrCreate()
os.environ[‘PYSPARK_SUBMIT_ARGS’] = ‘–packages io.crate:crate-jdbc-standalone:2.6.0 pyspark-shell’

df = spark.read
.format(“jdbc”)
.option(“url”, “crate://address:4200/”)
.option(“dbtable”, “tablename”)
.option(“user”, “crate”)
.load()

However, it keeps giving me the following error:

Py4JJavaError: An error occurred while calling o107.load.
: java.sql.SQLException: No suitable driver

Can someone please help me with the setup?

Hi @suchrandomstuff,

can you please try adding an option call setting the driver class (.option("driver", "io.crate.client.jdbc.CrateDriver")?

You can also use a standard PostgreSQL JDBC driver instead, we have a Spark-based example of how to connect in this post:

Hi, thanks a lot for your response.

I get the following error after adding that option:

Py4JJavaError: An error occurred while calling o141.load.
: java.lang.ClassNotFoundException: io.crate.client.jdbc.CrateDriver

I am not sure how or where to add the driver within a jupyter notebook.

Haven’t tested it, but maybe the solution suggested here in the reply works?

This doesn’t work either. Starting to seem impossible, lol. Been on it for 2 days now.

Dear @suchrandomstuff,

thank you for writing in, and for evaluating CrateDB in the context of Jupyter Notebooks and Spark, which also sparks [sic!] my interest. I can look into further details of this topic next week.

In general, to second @hammerhead, it is recommended to use the vanilla PostgreSQL JDBC driver [1] with CrateDB. Also in general, when aiming to connect to the PostgreSQL-compatible interface of CrateDB, addressing it on port 4200 is probably wrong, because this is the standard port of its HTTP interface.

It will probably not improve anything on your error, because it looks like the application is not even connecting to CrateDB, but croaks when loading the driver already. Still, I wanted to make you aware of the details I’ve spotted within your original post.

Please let us know about the outcome when using the vanilla driver, where the correct driver class name is org.postgresql.Driver.

With kind regards,
Andreas.


  1. https://jdbc.postgresql.org/ ↩︎