From fabric-skills
Migrates Azure HDInsight Spark/Hive workloads to Microsoft Fabric: replaces HiveContext with SparkSession, converts WASB/ABFS paths to OneLake, transforms Hive DDL to Delta Lake, and maps Oozie workflows to Fabric Pipelines.
How this skill is triggered — by the user, by Claude, or both
Slash command
/fabric-skills:hdinsight-migrationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
> **Update Check — ONCE PER SESSION (mandatory)**
Update Check — ONCE PER SESSION (mandatory) The first time this skill is used in a session, run the check-updates skill before proceeding.
- GitHub Copilot CLI / VS Code: invoke the
check-updatesskill.- Claude Code / Cowork / Cursor / Windsurf / Codex: compare local vs remote package.json version.
- Skip if the check was already performed earlier in this session.
CRITICAL NOTES
- To find workspace details (including its ID) from a workspace name: list all workspaces, then use JMESPath filtering
- To find item details (including its ID) from workspace ID, item type, and item name: list all items of that type in that workspace, then use JMESPath filtering
- HDInsight has no
mssparkutilsordbutilsequivalent —notebookutilsis net-new capability being introducedHiveContextandSQLContextare legacy Spark 1.x/2.x APIs — Fabric uses Spark 3.xSparkSessionexclusivelywasb://paths are deprecated and require a Storage Account key or SAS — replace with OneLake shortcuts
Read these companion documents before executing migration tasks:
az rest, az login, token acquisition, Fabric REST via CLIFor notebook and Lakehouse creation, see spark-authoring-cli. For Fabric Warehouse DDL/DML authoring, see sqldw-authoring-cli.
| Topic | Reference |
|---|---|
| Migration Workload Map | § Migration Workload Map |
| SparkSession & Context API Changes | § SparkSession API Changes |
| WASB / ABFS → OneLake Path Migration | path-migration.md |
| Hive DDL → Delta Lake / Lakehouse Schemas | hive-to-delta.md |
| Oozie → Fabric Pipelines | § Oozie → Fabric Pipelines |
Introducing notebookutils | § Introducing notebookutils |
| Before/After Code Patterns | code-patterns.md |
| Spark Configuration Differences | § Spark Configuration Differences |
| Must / Prefer / Avoid | § Must / Prefer / Avoid |
| Authentication & Token Acquisition | COMMON-CORE.md § Authentication |
| Lakehouse Management | SPARK-AUTHORING-CORE.md § Lakehouse Management |
| HDInsight Component | Fabric Target | Notes |
|---|---|---|
| Spark cluster (notebooks, scripts) | Fabric Spark (Lakehouse / Notebooks / SJD) | No persistent cluster — Starter Pool or Custom Pool provides on-demand Spark |
| Hive / HiveServer2 | Lakehouse SQL Endpoint + Lakehouse schemas | Delta Lake replaces Hive metastore; schemas provide namespace equivalent |
| HBase | Fabric Warehouse or Azure Cosmos DB (separate from Fabric) | HBase has no direct Fabric equivalent — assess workload access patterns |
| Oozie workflows | Fabric Data Pipelines | Map Oozie actions to Fabric activities; see § Oozie → Fabric Pipelines |
| YARN Resource Manager | Fabric Spark monitoring (Spark UI, Monitoring Hub) | No YARN — Fabric manages compute automatically |
| Ambari | Fabric Monitoring Hub + Admin Portal | Cluster health, capacity, and job monitoring |
| WASB / ABFS storage | OneLake Shortcuts → abfss://[email protected]/ | See path-migration.md |
| Ranger policies | Fabric workspace roles + OneLake data access roles | Map Ranger row/column filters to Lakehouse row-level security |
| Livy REST server | Fabric Livy API | Compatible endpoint — see SPARK-AUTHORING-CORE.md |
HDInsight Spark clusters often use legacy Spark 1.x / 2.x API styles. Replace all of these with the unified SparkSession:
| Legacy HDInsight Pattern | Fabric Spark 3.x Replacement |
|---|---|
from pyspark import SparkContext; sc = SparkContext() | Not needed — sc = spark.sparkContext (pre-instantiated) |
from pyspark.sql import HiveContext; hc = HiveContext(sc) | Not needed — spark session has Hive-compatible SQL support via Delta schemas |
from pyspark.sql import SQLContext; sqlc = SQLContext(sc) | Not needed — use spark.sql(...) directly |
SparkSession.builder.enableHiveSupport().getOrCreate() | Not needed in Fabric — spark is pre-built and available |
sc.textFile("wasb://[email protected]/path") | spark.read.text("abfss://[email protected]/lh.Lakehouse/Files/path") |
sqlContext.sql("CREATE TABLE ... STORED AS ORC") | See hive-to-delta.md for Delta DDL equivalent |
In Fabric notebooks,
spark(SparkSession) andsc(SparkContext) are pre-instantiated — do not callSparkContext()orSparkSession.builder...getOrCreate()at the top of migrated notebooks.
Map Oozie workflow actions to Fabric Data Pipeline activities:
| Oozie Action Type | Fabric Pipeline Activity | Notes |
|---|---|---|
<spark> action | Notebook activity or Spark Job Definition activity | Pass parameters via notebook cell parameters or SJD arguments |
<hive> action | Script activity (SQL) against Lakehouse SQL Endpoint | Convert HiveQL to Spark SQL or Delta SQL |
<shell> action | Azure Function activity or Web activity | Shell scripts must be refactored; no direct shell execution in Fabric Pipelines |
<java> action | Azure Batch activity (external) or refactor to PySpark | Java MapReduce jobs must be rewritten |
<sqoop> action | Copy Data activity (Fabric Data Factory connector) | Sqoop import/export maps to Fabric Copy Data with JDBC source/sink |
<coordinator> (time-based schedule) | Pipeline schedule trigger | Set recurrence in pipeline trigger; supports cron-like expressions |
<coordinator> (data-triggered) | Storage Event trigger | Trigger on OneLake file arrival |
Delegate to
spark-authoring-clifor notebook and SJD creation after mapping pipeline activities.
notebookutilsHDInsight Spark had no built-in utility framework equivalent to mssparkutils or dbutils. When migrating to Fabric, introduce notebookutils for common operations:
| Operation | Old HDInsight Approach | notebookutils Equivalent |
|---|---|---|
| List files | dbutils (N/A) / HDFS CLI | notebookutils.fs.ls("abfss://...") |
| Copy file | HDFS API / shutil | notebookutils.fs.cp(src, dest) |
| Read secret | Azure Key Vault REST call | notebookutils.credentials.getSecret(keyVaultUrl, secretName) |
| Get notebook context | Not available | notebookutils.runtime.context — returns workspace ID, notebook ID, etc. |
| Run child notebook | Not available | notebookutils.notebook.run("notebook_name", timeout, {"param": "value"}) |
| Exit notebook with value | sys.exit() | notebookutils.notebook.exit("value") |
| Mount storage | WASB config in spark-defaults.conf | OneLake Shortcut (no runtime mount needed) |
| HDInsight Concept | Fabric Spark Equivalent | Migration Action |
|---|---|---|
spark-defaults.conf (cluster-wide) | Fabric Spark Workspace Settings + Environment item | Move config properties to Environment or use %%configure in notebooks |
%%configure magic | %%configure magic — identical | No change needed |
| YARN queue / resource allocation | Fabric Spark pool node size and autoscale settings | Map queue SLAs to Custom Pool configuration |
| Ambari service configs (HDFS, YARN tuning) | Not applicable — Fabric manages infrastructure | Remove; focus on application-level Spark configs |
| HDI Spark version (e.g., Spark 2.4) | Fabric Runtime 1.3 = Spark 3.5 (latest) | Test for deprecated API removals (e.g., HiveContext, RDD-style ML) |
Conda environment / bootstrap.sh | Fabric Environment item with custom libraries | Recreate conda/pip dependencies in a Fabric Environment |
hive-site.xml (metastore connection) | Not needed — Delta Lake IS the metastore in Fabric | Remove metastore config; use Lakehouse schemas for namespace organization |
wasb:// / wasbs:// paths with OneLake abfss:// paths or OneLake Shortcuts — wasb:// requires storage account keys which are not the Fabric-preferred auth modelHiveContext, SQLContext, and standalone SparkContext() — use the pre-instantiated spark session in Fabric notebooksSTORED AS ORC, LOCATION, TBLPROPERTIES) to Delta Lake DDL — see hive-to-delta.mdnotebookutils for file system operations, secret retrieval, and child notebook orchestration where HDInsight used custom scripts or direct API callsbootstrap.sh, conda envs, and runtime %pip install patterns for production workloadsSparkContext() or HiveContext() constructors in Fabric notebooks — they conflict with the pre-instantiated spark session and will raise errorshive-site.xml or external Hive metastore configuration — Fabric's Delta Lake-backed Lakehouse IS the metastore%sh magic for file system operations in production notebooks — use notebookutils.fs.* for portability and OneLake token-based authSee code-patterns.md for full before/after examples. Key quick references:
Legacy context → Fabric pre-instantiated session
# HDInsight (remove entirely)
from pyspark.sql import HiveContext
hc = HiveContext(sc)
# Fabric — use pre-instantiated spark directly
df = spark.sql("SELECT * FROM sales.fact_orders")
WASB path → OneLake path (after shortcut creation)
# HDInsight
df = spark.read.parquet("wasb://[email protected]/orders/")
# Fabric
df = spark.read.parquet("Files/raw/orders/")
Hive DDL → Delta DDL
-- HDInsight
CREATE TABLE sales_db.fact_orders (...) STORED AS ORC LOCATION 'wasb://...';
-- Fabric
CREATE SCHEMA IF NOT EXISTS sales_db;
CREATE TABLE sales_db.fact_orders (...) USING DELTA;
npx claudepluginhub microsoft/skills-for-fabric --plugin fabric-skillsPort Databricks notebooks and jobs to Microsoft Fabric: dbutils→notebookutils, Unity Catalog 3-level→Lakehouse 2-level, DBFS→OneLake, Jobs→Spark Job Definitions, Photon→Native Execution Engine.
Provides expert guidance for Azure HDInsight development: troubleshooting, best practices, architecture, security, configuration, and deployment for Hive, Spark, Kafka, HBase clusters.
Analyzes lakehouse data interactively using Fabric Lakehouse Livy API sessions and PySpark/Spark SQL for DataFrames, joins, Delta time-travel, and JSON analysis.