Databricks Connector 1.0.0.1

The Databricks component allows you to list, import, automate, and repair Databricks jobs.

What's New in 1.0.0.1

This version of the Databricks Connector introduces the following new Process Definitions.

  • Redwood_Databricks_RepairJob: Lets you repair a failed Databricks job.

  • Redwood_Databricks_StartCluster, Redwood_Databricks_StartCluster: Lets you start and stop a Databricks cluster.

Other improvements are as follows.

  • You can now run a Databricks job by its name, rather than only by its ID.

  • RunMyJobs now writes the status of all tasks within a Databricks Job to the process log.

  • At the end of a job run, RunMyJobs generates an RTX file with a summary of all tasks.

  • The Redwood_Databricks_RunJob and Redwood_Databricks_RunJob_Template Process Definitions have a new Enable Restart Options parameter. When this is set to Y, you can initiate the repair of a failed Databricks job in one click from a RunMyJobs Operator Message.

Prerequisites

Contents of the Component

Object Type Name Description
Application GLOBAL.Redwood.REDWOOD.Databricks Integration connector with the Databricks system
Constraint Definition REDWOOD.Redwood_DatabricksConnectionConstraint Constraint for Databricks Connection fields
Constraint Definition REDWOOD.Redwood_DatabricksNotRunningClusterConstraint Constraint for Databricks Clusters fields
Constraint Definition REDWOOD.Redwood_DatabricksNotTerminatedClusterConstraint Constraint for Databricks Clusters fields
Extension Point REDWOOD.Redwood_DatabricksConnection Databricks Connector
Process Definition REDWOOD.Redwood_Databricks_ImportJob Import a job from Databricks
Process Definition REDWOOD.Redwood_Databricks_RepairJob Repair a failed Databricks Job Run
Process Definition REDWOOD.Redwood_Databricks_RunJob Run a job in Databricks
Process Definition REDWOOD.Redwood_Databricks_RunJob_Template Template definition to run a job in Databricks
Process Definition REDWOOD.Redwood_Databricks_ShowJobs List all existing jobs in Databricks
Process Definition REDWOOD.Redwood_Databricks_StartCluster Start a cluster in Databricks
Process Definition REDWOOD.Redwood_Databricks_StopCluster Stop a cluster in Databricks
Process Definition Type REDWOOD.Redwood_Databricks Databricks Connector
Library REDWOOD.Redwood_Databricks Library for Databricks connector

Redwood_Databricks_ImportJob

Import a job from Databricks. Imports one or more Databricks jobs as RunMyJobs Process Definitions. Specify a Name Filter to control what processes are imported, and Generation Settings to control the attributes of the imported definitions.

Tab Name Description Documentation Data Type Direction Default Expression Values
Parameters connection Connection The Connection object that defines the connection to the Databricks application. String In

Parameters filter Job Name Filter This filter can be used to limit the amount of jobs returned to those which name matches the filter. Wildcards * and ? are allowed. String In

Parameters overwrite Overwrite Existing Definition When set to Yes, if a definition already exists with the same name as the name generated for the imported object, it will be overwritten with the new import. When set to No, the import for that template will be skipped if a definition with the same name already exists. String In N Y,N
Generation Settings identifier Job Identifier Which field should be used as the Job Identifier on the imported definitions. String In JobName JobName, JobID
Generation Settings targetPartition Partition The Partition to create the new definitions in. String In

Generation Settings targetApplication Application The Application to create the new definitions in. String In

Generation Settings targetQueue Default Queue The default Queue to assign to the generated definitions. String In

Generation Settings targetPrefix Definition Name Prefix The prefix to add onto the name of the imported Databricks Job to create the definition name. String In CUS_DBCKS_

Redwood_Databricks_RepairJob

Repairs a failed Databricks Job Run.

Name Description Documentation Data Type Direction Default Expression Values
connection Connection The Connection object that defines the connection to the Databricks application. String In

jobRunId Job Run Id The unique id of the Job Run to perform the repair against. String In

lastRepairId Last Repair Id The repair id for the last repair run if this job run has previously been repaired. String In

enableRestartOptions Enable Restart Options Set this to Y, to enable restart options for Databricks Job. If the Databricks Job fails, the RunMyJobs process will go to status Console and await the reply from a generated Operator Message before proceeding. String In N Y, N
sparkJarParameters Spark Jar Parameters An array of Spark Jar Parameters to be used on the Databricks Job. String In

sparkSubmitParameters Spark Submit Parameters An array of Spark Submit Parameters to be used on the Databricks Job. String In

notebookParameters Notebook Parameters An array key=value pairs of Notebook Parameters to be used on the Databricks Job. String In

pythonParameters Python Parameters An array of Python Parameters to be used on the Databricks Job. String In

pythonNamedParameters Python Named Parameters An array key=value pairs of Python Named Parameters to be used on the Databricks Job. String In

sqlParameters SQL Parameters An array key=value pairs of SQL Parameters to be used on the Databricks Job. String In

dbtParameters DBT Parameters An array of DBT Parameters to be used on the Databricks Job. String In

pipelineFullRefresh Pipeline Full Refresh Should a full refresh be performed on the Databricks Pipeline Job. String In

Y, N
runId Databricks Run Id The Job Run ID of the Databricks Job. String Out

repairId Databricks Repair Id The Repair ID for this repair run. String Out

taskSummary Task Summary Summary of all tasks that were part of this run. Table Out

Redwood_Databricks_RunJob

Runs a Databricks job and monitors it until completion. The RunMyJobs Process will remain in a Running state until the Databricks job completes. If the Databricks Job succeeds, the RunMyJobs process will complete successfully. If the Databricks Job fails, the RunMyJobs process will complete in Error, and any available error information is written to the stdout.log file. Parameters are available on the definition to pass input parameters for the different types of Databricks tasks. For example, adding a value to the Python Parameters parameter will make that parameter available to all Python tasks in the Databricks Job. If the job does not require parameters for a certain task type, leave that parameter empty. See the parameters table below for more information.

Name Description Documentation Data Type Direction Values
connection Connection The Connection object that defines the connection to the Databricks application. String In

jobId Job ID to run This is the Job ID in Databricks to execute. String In

jobName Job Name The name of the job to run. This can be provided instead of the Job Id. String In  
enableRestartOptions Enable Restart Options Set this to Y, to enable restart options for Databricks Job. If the Databricks Job fails, the RunMyJobs process will go to status Console and await the reply from a generated Operator Message before proceeding. String In N
sparkJarParameters Spark Jar Parameters An array of Spark Jar Parameters to be used on the Databricks Job String In  
sparkSubmitParameters Spark Submit Parameters An array of Spark Submit Parameters to be used on the Databricks Job. String In

notebookParameters Notebook Parameters An array key=value pairs of Notebook Parameters to be used on the Databricks Job. String In

pythonParameters Python Parameters An array of Python Parameters to be used on the Databricks Job. String In

pythonNamedParameters Python Named Parameters An array key=value pairs of Python Named Parameters to be used on the Databricks Job. String In

sqlParameters SQL Parameters An array key=value pairs of SQL Parameters to be used on the Databricks Job. String In

dbtParameters DBT Parameters An array of DBT Parameters to be used on the Databricks Job. String In

pipelineFullRefresh Pipeline Full Refresh Should a full refresh be performed on the Databricks Pipeline Job. String In Y=Yes, N=No
runId Databricks Run ID The Run ID of the executed Job on the Databricks side. String Out

taskSummary Task Summary Summary of all tasks that were part of this run. Table Out  

Redwood_Databricks_RunJob_Template

This template definition is provided to facilitate creating definitions that run specific Databricks jobs. Its functionality and parameters are the same as the Redwood_Databricks_RunJob definition. To create a definition, Choose New (from template) from the context menu of Redwood_Databricks_RunJob_Template.

Note: To provide a default value for the Connection in the Connection parameter of the template, you must use the full Business Key of the Connection: EXTConnection:<Partition>.<ConnectionName>. Example: EXTConnection:GLOBAL.MyDatabricksConnection

Redwood_Databricks_ShowJobs

Lists all existing jobs in Databricks. Fetches information about the available Databricks Jobs. Job properties for returned jobs are written to the stdout.log file, the file named listing.rtx, as well as the Out parameter Job Listing.

Name Description Documentation Data Type Direction
connection Connection The Connection object that defines the connection to the Databricks application. String In
filter Job Name Filter This filter can be used to limit the amount of jobs returned to those which name matches the filter. Wildcards * and ? are allowed. String In
listing Job listing The listing of all jobs available that match the input filter (or any if no input filter was provided). Table Out

Redwood_Databricks_StartCluster

Starts a cluster in Databricks.

Name Description Documentation Data Type Direction
connection Connection The Connection object that defines the connection to the Databricks application. String In
clusterId Cluster to start This is the cluster id in Databricks to start. String In

Redwood_Databricks_StopCluster

Stops a cluster in Databricks.

Name Description Documentation Data Type Direction
connection Connection The Connection object that defines the connection to the Databricks application. String In
clusterId Cluster to stop This is the cluster id in Databricks to stop. String In

Setup

  1. Locate the Databricks component in the Catalog and install it.
  2. Navigate to Custom > Connections.
  3. Click .
  4. Click the Databricks connection type.

  5. Click Next or Basic Properties, then create a Queue and Process Server for the connector. All required settings will be set up automatically.

  6. Click Next or Security, then click to specify which roles can access the connection information. It is recommended to grant the role at least the following additional privileges: View on the Databricks Connector Process Server, View Processes on the Databricks Connector Jobs Queue, View on library REDWOOD.Redwood_Databricks, and Submit on any Process Definitions that users with this role will submit.
  7. Click Next or Databricks Connection Properties. You have two options for authenticating with Databricks.
    • Databricks Basic Authentication. Enter the URL for your Databricks instance, your Username, and your Password.

    • Databricks Personal Access Token. Enter the URL of your Databricks instance and your Access Token.

  8. Click Save & Close.
  9. Navigate to Environment > Process Server, locate your Databricks Connector Process Server, start it, and make sure it reaches status Running.

Listing Databricks Jobs

To retrieve a list of Databricks jobs:

  1. Navigate to Definitions > Processes and submit Redwood_Databricks_ShowJobs.
  2. Choose the Connection.

  3. Choose a Namespace.

  4. To specify a search string for the job name, enter a value in the Job Name Filter field. Wildcards * and ? are supported.

  5. Submit the Process Definition.

Importing a Databricks Job

To import a Databricks job:

  1. Navigate to Applications > Redwood > Databricks, and submit Redwood_Databricks_ImportJob.

  2. On the Parameters tab, do this:

    1. Choose the Connection.

    2. To specify a search string for the job name, enter a value in the Job Name Filter field. Regular expressions are supported.

    3. Choose an option from the Overwrite Existing Definition drop-down list.

  3. On the Generation Settings tab, do this:

    1. Choose an option from the Job Identifier drop-down list.

    2. Optionally specify a Partition, Application, and/or Default Queue.

    3. In the Definition Name Prefix field, enter a prefix to add onto the name of the imported Databricks Job when creating the name of the Process Definition.

  4. Click Submit.

Running a Databricks Job

To run a Databricks job:

  1. In the Databricks Application, submit Redwood_Databricks_RunJob.

  2. In the Parameters tab, specify the parameters you want to use for the job. For more information, see Redwood_Databricks_RunJob.

  3. Submit the Process Definition.

Running a Databricks Job with a Template

To create a customized Process Definition, optionally with default values, for a Databricks job:

  1. Right-click the Redwood_Databricks_RunJob_Template Process Definition and choose New (from Template) from the context menu. The New Process Definition pop-up window displays.

  2. Choose a Partition.

  3. Enter a Name.

  4. Delete the default Application value (if any) and substitute your own Application name if desired.
  5. In the Parameters tab, enter any Default Expressions you want to use.

    • When specifying the Connection value, use the format EXTCONNECTION:<partition>.<connection name>.

  6. Save and then submit the new Process Definition.

Repairing a Databricks Job

If a step in a Databricks job fails (for example, because of bad parameter or a temporary network connectivity issue), you can click Repair run for that job in the Databricks user interface, and the job will resume running starting with the step that failed, rather than starting over from scratch. Being able to do this from RunMyJobs makes it easier to address issues that (for example) cause a Chain to fail in the middle of execution without having to use the Databricks user interface.

There are two ways to repair a failed Databricks job in RunMyJobs.

  • When you submit the Redwood_Databricks_RunJob Process Definition, set the Enable Restart Options parameter to Y. If the Databricks job fails, RunMyJobs will generate an Operator Message. Once the issue has been resolved, the Operator can choose Repair Databricks Job from the Reply drop-down list in the Operator Reply dialog box to repair the job immediately.

  • Run the Redwood_Databricks_RepairJob Process Definition. This approach allows you to change the job's parameters if necessary.

Note: It is possible that a call to the Redwood_Databricks_RepairJob Process Definition may itself fail. If you manually rerun the Redwood_Databricks_RepairJob Process Definition to repair the job again, make sure you enter the Repair ID from the failed repair run (you can find this in the repairId output parameter) as the Last Repair Id input parameter. That way, Databricks knows where to pick up repairing the job again. (If you use the Repair Databricks Job option in the Operator Reply dialog box, rather than manually resubmitting the Process Definition, the Repair ID is sent to Databricks automatically.)

Starting a Databricks Cluster

To start a Databricks cluster:

  1. Right-click the Redwood_Databricks_StartCluster Process Definition and choose Submit from the context menu.

  2. Choose the Connection.

  3. Select the name of the cluster to restart from the Cluster to start drop-down list.

  4. Submit the Process Definition.

Stopping a Databricks Cluster

To stop a Databricks cluster:

  1. Right-click the Redwood_Databricks_StopCluster Process Definition and choose Submit from the context menu.

  2. Choose the Connection.

  3. Select the name of the cluster to restart from the Cluster to stop drop-down list.

  4. Submit the Process Definition.