PySpark – Page 4 – BroadOakData.UK

“Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.”
https://spark.apache.org/

I have been using Apache Spark for developing and testing data pipelines on single-node machine mainly using Google Colaboratory . It is vital to understand how the resources are being used and how can you monitor the performance of your Spark application.

Web Interfaces

Every SparkContext launches a Web UI, by default on port 4040, that displays useful information about the application. This includes:

A list of scheduler jobs, stages, storage, environment, executors and SQL / Dataframe
A summary of RDD sizes and memory usage
Environmental and Storage information.
Information about the running executors

You can access this interface by simply opening http://<driver-node>:4040 in a web browser. If multiple SparkContexts are running on the same host, they will bind to successive ports beginning with 4040 (4041, 4042, etc).

Note that this information is only available for the duration of the application by default. To view the web UI after the fact, set spark.eventLog.enabled to true before starting the application. This configures Spark to log Spark events that encode the information displayed in the UI to persisted storage.

Spark UI API Endpoints

You can also use Spark UI API endpoints to access different aspects of your Spark application:

All Applications: /applications
All Executors: /applications/{app_id}/executors
All Jobs: /applications/{app_id}/jobs
Job Details: /applications/{app_id}/jobs/{job_id}
All Stages: /applications/{app_id}/stages
Stage Attempts: /applications/{app_id}/stages/{stage_id}?withSummaries={with_summaries}
Stage Details: /applications/{app_id}/stages/{stage_id}/{attempt_id}
Task Summary: /applications/{app_id}/stages/{stage_id}/{attempt_id}/taskSummary
Task List: /applications/{app_id}/stages/{stage_id}/{attempt_id}/taskList?length=50000
SQL Details: /applications/{app_id}/sql?offset={offset}&length={length}
SQL Single Execution Details: /applications/{app_id}/sql/{execution_id}

Use these endpoints to gather detailed information about your Spark application’s performance and status.

Access Spark UI on Google Colab using Ngrok

Make sure you install pyngrok package and already have Ngrok Authentication token. A snippet of code is given below.

# create tunnel for spark ui
from pyngrok import ngrok, conf
import getpass

print("Enter your authtoken, which can be copied "
"from https://dashboard.ngrok.com/auth")
conf.get_default().auth_token = NGROK_AUTH_KEY

ui_port = 8088
public_url = ngrok.connect(ui_port).public_url
print(f" * ngrok tunnel \"{public_url}\" -> \"http://127.0.0.1:{ui_port}\"")

You can use Google Colab paython package and make sure path is set to /jobs/index.html:

from google.colab import output
output.serve_kernel_port_as_window(4040, path='/jobs/index.html')

Explore Spark UI and inspect metrics

We will try to read a partitioned parquet file (narrow transformation) and carry out few group by operations (wide transformation – requires reading data from different partitions – shuffling).

%%time
from pyspark.sql import SparkSession
scala_version = '2.12'  
spark_version = '3.5.0'
packages = [
    f'org.apache.spark:spark-sql-kafka-0-10_{scala_version}:{spark_version}',
    'org.apache.kafka:kafka-clients:3.6.1'
]
spark = (SparkSession.builder
         .appName("SparkConfigExample")
         .config("spark.jars.packages", ",".join(packages))
         .getOrCreate()
)

# create dataframe
df = spark.read.parquet("/content/content/broadoak-logs.parquet")
df.cache()
# group by
df.filter(df.endpoint=='/xmlrpc.php').groupby('ipaddress').count().orderBy('count',ascending=False).show(2000,truncate=False)

Tag: PySpark

Apache Spark UI and key metrics

Web Interfaces

Spark UI API Endpoints

Access Spark UI on Google Colab using Ngrok

Explore Spark UI and inspect metrics