Apache Spark UI and key metrics

Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.”

https://spark.apache.org/

Web Interfaces

  • A list of scheduler jobs, stages, storage, environment, executors and SQL / Dataframe
  • A summary of RDD sizes and memory usage
  • Environmental and Storage information.
  • Information about the running executors
Spark UI

Spark UI API Endpoints

  • All Applications: /applications
  • All Executors: /applications/{app_id}/executors
  • All Jobs: /applications/{app_id}/jobs
  • Job Details: /applications/{app_id}/jobs/{job_id}
  • All Stages: /applications/{app_id}/stages
  • Stage Attempts: /applications/{app_id}/stages/{stage_id}?withSummaries={with_summaries}
  • Stage Details: /applications/{app_id}/stages/{stage_id}/{attempt_id}
  • Task Summary: /applications/{app_id}/stages/{stage_id}/{attempt_id}/taskSummary
  • Task List: /applications/{app_id}/stages/{stage_id}/{attempt_id}/taskList?length=50000
  • SQL Details: /applications/{app_id}/sql?offset={offset}&length={length}
  • SQL Single Execution Details: /applications/{app_id}/sql/{execution_id}

Use these endpoints to gather detailed information about your Spark application’s performance and status.

Access Spark UI on Google Colab using Ngrok

# create tunnel for spark ui
from pyngrok import ngrok, conf
import getpass

print("Enter your authtoken, which can be copied "
"from https://dashboard.ngrok.com/auth")
conf.get_default().auth_token = NGROK_AUTH_KEY

ui_port = 8088
public_url = ngrok.connect(ui_port).public_url
print(f" * ngrok tunnel \"{public_url}\" -> \"http://127.0.0.1:{ui_port}\"")
NGROK - pip install pyngrok

You can use Google Colab paython package and make sure path is set to /jobs/index.html:

from google.colab import output
output.serve_kernel_port_as_window(4040, path='/jobs/index.html')

Explore Spark UI and inspect metrics

%%time
from pyspark.sql import SparkSession
scala_version = '2.12'  
spark_version = '3.5.0'
packages = [
    f'org.apache.spark:spark-sql-kafka-0-10_{scala_version}:{spark_version}',
    'org.apache.kafka:kafka-clients:3.6.1'
]
spark = (SparkSession.builder
         .appName("SparkConfigExample")
         .config("spark.jars.packages", ",".join(packages))
         .getOrCreate()
)

# create dataframe
df = spark.read.parquet("/content/content/broadoak-logs.parquet")
df.cache()
# group by
df.filter(df.endpoint=='/xmlrpc.php').groupby('ipaddress').count().orderBy('count',ascending=False).show(2000,truncate=False)
group by - wide transformation
spark ui - jobs
spark ui - storage
spark ui - stages
spark ui - completed tasks
spark ui - executors
spark ui - SQL/ Dataframe
spark ui - details for query
spark ui - query details