“Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.”
https://spark.apache.org/
I have been using Apache Spark for developing and testing data pipelines on single-node machine mainly using Google Colaboratory . It is vital to understand how the resources are being used and how can you monitor the performance of your Spark application.
Web Interfaces
Every SparkContext launches a Web UI, by default on port 4040, that displays useful information about the application. This includes:
- A list of scheduler jobs, stages, storage, environment, executors and SQL / Dataframe
- A summary of RDD sizes and memory usage
- Environmental and Storage information.
- Information about the running executors
You can access this interface by simply opening http://<driver-node>:4040
in a web browser. If multiple SparkContexts are running on the same host, they will bind to successive ports beginning with 4040 (4041, 4042, etc).
Note that this information is only available for the duration of the application by default. To view the web UI after the fact, set spark.eventLog.enabled
to true before starting the application. This configures Spark to log Spark events that encode the information displayed in the UI to persisted storage.
Spark UI API Endpoints
You can also use Spark UI API endpoints to access different aspects of your Spark application:
- All Applications: /applications
- All Executors: /applications/{app_id}/executors
- All Jobs: /applications/{app_id}/jobs
- Job Details: /applications/{app_id}/jobs/{job_id}
- All Stages: /applications/{app_id}/stages
- Stage Attempts: /applications/{app_id}/stages/{stage_id}?withSummaries={with_summaries}
- Stage Details: /applications/{app_id}/stages/{stage_id}/{attempt_id}
- Task Summary: /applications/{app_id}/stages/{stage_id}/{attempt_id}/taskSummary
- Task List: /applications/{app_id}/stages/{stage_id}/{attempt_id}/taskList?length=50000
- SQL Details: /applications/{app_id}/sql?offset={offset}&length={length}
- SQL Single Execution Details: /applications/{app_id}/sql/{execution_id}
Use these endpoints to gather detailed information about your Spark application’s performance and status.
Access Spark UI on Google Colab using Ngrok
Make sure you install pyngrok package and already have Ngrok Authentication token. A snippet of code is given below.
# create tunnel for spark ui
from pyngrok import ngrok, conf
import getpass
print("Enter your authtoken, which can be copied "
"from https://dashboard.ngrok.com/auth")
conf.get_default().auth_token = NGROK_AUTH_KEY
ui_port = 8088
public_url = ngrok.connect(ui_port).public_url
print(f" * ngrok tunnel \"{public_url}\" -> \"http://127.0.0.1:{ui_port}\"")
You can use Google Colab paython package and make sure path is set to /jobs/index.html:
from google.colab import output
output.serve_kernel_port_as_window(4040, path='/jobs/index.html')
Explore Spark UI and inspect metrics
We will try to read a partitioned parquet file (narrow transformation) and carry out few group by operations (wide transformation – requires reading data from different partitions – shuffling).
%%time
from pyspark.sql import SparkSession
scala_version = '2.12'
spark_version = '3.5.0'
packages = [
f'org.apache.spark:spark-sql-kafka-0-10_{scala_version}:{spark_version}',
'org.apache.kafka:kafka-clients:3.6.1'
]
spark = (SparkSession.builder
.appName("SparkConfigExample")
.config("spark.jars.packages", ",".join(packages))
.getOrCreate()
)
# create dataframe
df = spark.read.parquet("/content/content/broadoak-logs.parquet")
df.cache()
# group by
df.filter(df.endpoint=='/xmlrpc.php').groupby('ipaddress').count().orderBy('count',ascending=False).show(2000,truncate=False)