Machine learning delivers insights in Power BI reports—and it enables you to get a large amount of data into your reports to generate those insights more quickly.
The goal of Power BI (and any business intelligence tool) is to replace the hunches and opinions businesses use to make decisions with facts based on data. That means the insights in that data have to be available quickly, so you can pull up a report while people are still discussing what it covers, not five minutes later when everyone has already made up their mind. To make that happen even with large data sets, wherever they’re stored, Microsoft now uses machine learning to tune how the data gets accessed.
When you have enough data to make decisions with, you need to consolidate and summarize it, while still keeping the original dimensions—so you can look at total sales combined across all departments and get an overview but then slice it by region or month to compare trends. Most Power BI users need these aggregated queries, CTO of Microsoft Analytics Amir Netz told TechRepublic.
“They don’t care about the individual tickets on the plane or the orders in the supermarket; they want to slice and dice data at an aggregated level.”
Those aggregated queries need to scan a lot of data but what they produce is very condensed, he explained. “I can scan 250 billion rows of data if I ask for sales by month by geography; the results, even though it has 250 billion rows underneath, sales by month by geography will have maybe 1,000 rows in it. So it’s a huge reduction in volume.”
Speeding up the speed-up
If the data getting aggregated is billions of rows, you probably want to leave it in your data warehouse rather than copying it into Power BI, but that can make query performance much slower as you wait for the data to be queried, loaded and aggregated. Querying and aggregating 3 billion rows in 30 seconds might not seem long, but you have that delay every time you change how you want to slice the data. “That’s going to get on the user’s nerves; waiting 30 seconds for every click is very disruptive.”
The solution is to create the data aggregations in advance so Power BI can keep them in memory. “If I have that aggregate ready, then getting the results from that aggregate is way faster than trying to go all the way down to the bottom, where all the masses of data are and aggregate the whole 250 billion rows. Being able to create those aggregates is key to basically speeding up queries.”
But knowing which aggregates to create in advance isn’t obvious: It requires analyzing query patterns and doing lot of query optimization to find out which aggregates are used frequently. Creating aggregations you don’t end up using is a waste of time and money. “Creating thousands, tens of thousands, hundreds of thousands of aggregations will take hours to process, use huge amounts of CPU time that you’re paying for as part of your licence and be very uneconomic to maintain,” Netz warned.
To help with that, Microsoft turned to some rather vintage database technology dating back to when SQL Server Analysis Service relied on multidimensional cubes, before the switch to in-memory columnar stores. Netz originally joined Microsoft when it acquired his company for its clever techniques around creating collections of data aggregations.
“The whole multidimensional world was based on aggregates of data,” he said. “We had this very smart way to accelerate queries by creating a collection of aggregates. If you know what the user queries are, [you can] find the best collection of aggregates that will be efficient, so that you don’t need to create surplus aggregates that nobody’s going to use or that are not needed because some other aggregates can answer [the query]. For example, if I aggregate the data on a daily basis, I don’t need to aggregate on a monthly basis because I can answer the aggregates for months from the aggregates for the day.”
Netz said it’s key to find the unique collection of aggregates that’s “optimal for the usage pattern.” That way, you don’t create unnecessary aggregates.
SEE: Electronic Data Disposal Policy (TechRepublic Premium)
Now those same techniques are being applied to the columnar store that Power BI uses, by collecting the queries generated by Power BI users, analyzing what level of aggregate data would be needed to answer each query and using machine learning to solve what turns out to be a classic AI optimization problem.
“We have these tens and hundreds of thousands of queries that users have been sending to the data set and the system has the statistics that 5% of the queries are at this level of granularity and another 7% are at this other level of granularity. It automatically analyses them using machine learning to say ‘what is the optimal set of aggregates to give you the best experience possible with a given set of resources?'”
“As users are using the system the system is learning. what is the most common data set that they are using, what are the most common queries being sent, and we always try to anticipate what the user is going to try to do next, and make sure that we have the data in the right place at the right time in the right structure, ahead of what they asked for, and even execute queries, ahead of time for them. When they come in, their query is already laid out so they don’t want to wait for the those queries to be executed. We can do predictive execution of those queries using AI and machine learning.”
The difference can be dramatic, as Microsoft demonstrated using the public dataset of New York taxi journeys stored as three billion rows of data in Azure Synapse. Without automatic aggregation, queries take around 30 seconds each; once the AI has optimised the collection of aggregates stored they drop to just over a second. For one customer with a data warehouse of about 250 billion rows, turning the feature on improved median query time by a factor of 16. “These are big heavy duty queries that we can accelerate at 16x,” Netz told us.
Make your own trade-offs
If users start looking for different insights in the data and Power BI needs different aggregates to optimize them, it will retune the set of aggregates to match. That happens automatically because old queries age out of the system, although you can choose how often to redefine the aggregates if the way you use data changes frequently.
“The assumption is that the same query is being used again and again so we’ll see it in the newer window of time. But if the patterns have really changed, if people realize the reports are irrelevant and they really need to look at the data differently, the system will realize that those queries that were sent a month ago are not being used anymore.”
Using a rolling window for queries means someone experimenting with different queries won’t cause aggregations to be thrown away and then re-created. “It’s a gradual not an abrupt process of aging because the system needs to know if this is a fleeting moment or is it really a pattern that is being established.”
When you turn on automatic aggregation in the dataset settings, Power BI will make its own decisions about how many resources to use for optimizing query performance.
“In a world where resources are infinite I could have created an aggregate for every possible query the system would ever imagine seeing, but the number of combinations isn’t based on the number of attributes and dimensions of the table that you have; it’s actually factorial. Your data is so rich, there are so many attributes to everything that’s not a possibility. The system has to make intelligent selections to make sure that it doesn’t go into infinite resources.”
SEE: Learn the skills to be a data analyst with courses on Python, Excel, Power BI and more (TechRepublic Academy)
But if you want to tune those trade-offs, you can drag a slider to cache more queries—and use more storage space. A chart shows you what percentage of queries will run faster than the SLA you’ve set and how much more space that takes off. Going from caching 75% to 85% of queries might mean 90% of queries come in faster but it might also mean maintaining 100 aggregations rather than 60 or 70. Go up to 100% of queries and you’ll need thousands of aggregations. “Every obscure query will be covered but you’re spending a lot of CPU maintaining those aggregates.”
The slider lets you make that choice. “Maybe the user says I’m willing to pay more resources because the value I put on performance is higher than the default of the system, so let me determine that.”
But users also like the feeling of being in control rather than seeing the optimization as a black box, even if they end up putting it back to the original default. “It helps them understand what’s going on behind the scenes,” Netz said—something that’s important for making people comfortable with AI tools.