Tag: PySpark

  • Apache Spark UI and key metrics

    Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.”

    https://spark.apache.org/

    Web Interfaces

    • A list of scheduler jobs, stages, storage, environment, executors and SQL / Dataframe
    • A summary of RDD sizes and memory usage
    • Environmental and Storage information.
    • Information about the running executors
    Spark UI

    Spark UI API Endpoints

    • All Applications: /applications
    • All Executors: /applications/{app_id}/executors
    • All Jobs: /applications/{app_id}/jobs
    • Job Details: /applications/{app_id}/jobs/{job_id}
    • All Stages: /applications/{app_id}/stages
    • Stage Attempts: /applications/{app_id}/stages/{stage_id}?withSummaries={with_summaries}
    • Stage Details: /applications/{app_id}/stages/{stage_id}/{attempt_id}
    • Task Summary: /applications/{app_id}/stages/{stage_id}/{attempt_id}/taskSummary
    • Task List: /applications/{app_id}/stages/{stage_id}/{attempt_id}/taskList?length=50000
    • SQL Details: /applications/{app_id}/sql?offset={offset}&length={length}
    • SQL Single Execution Details: /applications/{app_id}/sql/{execution_id}

    Use these endpoints to gather detailed information about your Spark application’s performance and status.

    Access Spark UI on Google Colab using Ngrok

    # create tunnel for spark ui
    from pyngrok import ngrok, conf
    import getpass
    
    print("Enter your authtoken, which can be copied "
    "from https://dashboard.ngrok.com/auth")
    conf.get_default().auth_token = NGROK_AUTH_KEY
    
    ui_port = 8088
    public_url = ngrok.connect(ui_port).public_url
    print(f" * ngrok tunnel \"{public_url}\" -> \"http://127.0.0.1:{ui_port}\"")
    NGROK - pip install pyngrok

    You can use Google Colab paython package and make sure path is set to /jobs/index.html:

    from google.colab import output
    output.serve_kernel_port_as_window(4040, path='/jobs/index.html')

    Explore Spark UI and inspect metrics

    %%time
    from pyspark.sql import SparkSession
    scala_version = '2.12'  
    spark_version = '3.5.0'
    packages = [
        f'org.apache.spark:spark-sql-kafka-0-10_{scala_version}:{spark_version}',
        'org.apache.kafka:kafka-clients:3.6.1'
    ]
    spark = (SparkSession.builder
             .appName("SparkConfigExample")
             .config("spark.jars.packages", ",".join(packages))
             .getOrCreate()
    )
    
    # create dataframe
    df = spark.read.parquet("/content/content/broadoak-logs.parquet")
    df.cache()
    # group by
    df.filter(df.endpoint=='/xmlrpc.php').groupby('ipaddress').count().orderBy('count',ascending=False).show(2000,truncate=False)
    
    group by - wide transformation
    spark ui - jobs
    spark ui - storage
    spark ui - stages
    spark ui - completed tasks
    spark ui - executors
    spark ui - SQL/ Dataframe
    spark ui - details for query
    spark ui - query details