Weaving data analytics with Microsoft Fabric

In spring 2023, Microsoft launched Fabric, a new all-in-one analytics solution for the AI era.  Available to businesses, it covers everything from data movement to data science, Real-Time Analytics, and business intelligence. It offers a comprehensive suite of services, including data lake, data engineering, and data integration, all in one place.

Fabio Hepworth, Head of Data and Integration in our Business Solutions Team tell us what he’s learned so far.

 

My recent travels with Microsoft Fabric (referred to as Fabric from here in) Microsoft’s new all-in-one analytics solution for the AI era, has been an interesting journey. Like most nerdy techies I love a new shiny tool in the toolbox, but I have been around long enough that I have a (un)healthy dose of cynicism.

Setting the scene

My current project is migrating a large data warehouse from on-premises SQL into a Fabric Warehouse. The workload is large enough (over 2.5TB) that running this in SQL Server is a challenging proposition. Moving this into an analytical toolset/platform should see a reduction in cost and an increase in performance.

The Microsoft documentation points to this being a smooth process as Fabric warehouses leverage the TSQL language for a ‘lift and shift’ approach. Broadly speaking this has been the case, resulting in a straightforward and relatively quick migration:

  • SSIS packages translated into Data Factory Pipelines and Data Flow Gen 2.
  • SQL tables and stored procedures kept as is (minor changes to data types and temp tables).
  • Migration of data via CSV/Parquet exports.
  • Approximately 2.5 terabytes of data. Mostly in one fact table.

The project’s objectives are:

  • Migrate the warehouse to Fabric with little or no impact to downstream reports/processes.
  • Reduce operational costs by taking advantage of cloud storage/compute.
  • Improve performance and reliability of ETL and warehouse.

Why Warehouse over Lakehouse?

The first question we asked was: which approach is best for this project, Warehouse or Lakehouse. We chose Warehouse because it uses TSQL as the primary language for working with the Warehouse,  making it easier for the in-house team to adopt and maintain the final solution.

Ultimately, the decision is about what language the team developing/maintaining the solution has proficiency or comfort with:

  • If you have data engineers with TSQL experience? Lean towards Warehouse.
  • If you have data analysts/scientists with Spark experience? Lean towards Lakehouse.

The choice shouldn’t get in the way though, as all Fabric tools can read from Warehouse/Lakehouse thanks to using the same underlying storage format (Delta Lake Parquet). So, your data scientists can use Spark on a Warehouse. Bonus feature here is that it doesn’t take any effort to get these tools to work together, they ‘just do’.

Is this expensive?

Cost has been the biggest draw for me to Fabric. We have been using Azure Synapse Workspace and while it’s a fantastic tool, it has always seemed too expensive for smaller clients.

Fabric is costed via capacity like Power BI; however, it comes in some very small SKUs which is going to let us bring the same enterprise ready tools to our smaller clients. The smallest SKU costs £200 per month and that is if it is running 24/7.

Things to get used to

At the core, Fabric Warehouse has a different engine to SQL Server which give us ‘normal’ SQL developers a learning curve (at least at the time of writing):

  • No temp tables, indexes, computed columns,
  • Primary/foreign keys not enforced.
  • Some data type differences, no Unicode support.

Knowing that the storage is Delta Lake Parquet, and processing is happening in parallel these limitations do make sense and haven’t hindered us.

The other challenge was getting files into OneLake to then be processed by the pipelines. OneLake is based on Azure Data Lake Storage Gen 2 which means anything that it can talk to ADLS Gen 2 which can talk to OneLake. In practicality, there are some subtle URL differences that mean most tools can’t yet handle OneLake. Lower-level tools (AzCopy.exe, Azure Storage SDK) have no issue though so we are still able to upload files. In time, Power Automate, Logic Apps, SSIS, etc will catch up.

Takeaways so far

Our experience has shown that SSIS and SQL workloads can migrate smoothly to Fabric and enabled some efficiencies due to additional features.

Creation of tables, views, and stored procedures in Warehouse was straightforward due to using similar definition of from SQL Server.

Knowledge sharing with the in-house BI team has been smooth due to using familiar TSQL language. This isn’t finished yet, but so far so good.

Fabric has a very cohesive/integrated working environment. The interface can be switched into different personas depending on what the end user is doing: data engineering, data science, reporting, etc.

The Fabric Workspace can be seen through different ‘personas’ to tailor the experience.

Fabric has higher performance compared to source SQL Server. Showing a performance increase of 20% to 45%.

Fabric has only recently come out of preview. Our research and work so far has shown it is ready for prime time. It is built upon pre-existing foundations (Azure Synapse Workspace and Power BI) with enough new features, and exciting future roadmap, that we are happy to recommend it for current and future customers. Conversely, Azure Synapse Workspace seems to be the poor cousin in this by comparison.

Migration of data has also been smooth with Fabric providing multiple options to meet your requirements. We chose export/import through files to ensure timing and load on the production warehouse was kept to the minimum.

  • Pull in with Data Factory or Data Flows.
  • Push in with SSIS.
  • Push in with Synapse Link for SQL.
  • Export/import processes via CSV/Excel/Parquet files.

There is very good support for programmatic access to export/import/upload of data into Fabric. We didn’t use this extensively but did perform research to understand the ‘art of the possible’.

Fabric has recently released ‘No Copy Table Clone’ which is going to be an amazing for testing. Allowing a table to be cloned near instantaneously and updates to be made to it without affecting the original. Testing the impact of changes on large, multi-million row, tables just became a lot easier/safer.