Skip to main content

Posts

Power Query: Transforming YYYYMM dates (the quick way)

Accountants. Their unit of work seems to be the month, as if individual days don't exists, or don't count somehow. Nowhere is this better seen than in the notion of the accounting period , which all too often follows the form YYYYMM.  Try converting this directly into a date and Power Query starts making excuses faster than a kid with his hand caught in the cookie jar. The quick solution to this is to understand what Power Query's Table.TransformColumns does, and then leverage this knowledge to transform your YYYYMM values into proper date type columns. Table.TransformColumns As it's name suggests, this handy function allows you to convert the contents of a column from one type to another. The basic syntax is: = Table.TransformColumns( #"Your Source Table", { A list of tuples, specifying the columns and what functions to apply to each value} ) Lists {denoted by curly braces} are something you need to get comfortable with if you
Recent posts

Power Query: Converting Monetary Values

Power Query has quickly become my favorite data transformation tool because of its fluid, interactive approach to describing data transformation. It isn't just easier to use that traditional ETL tools, it's also significantly more powerful and is a lot more forgiving.     Perhaps too forgiving... Recently I worked on a solution for processing disparate spreadsheets that we receive from our partners. In theory it's simple enough: Combine the data tables, clean up the data, spit out the result. The problem came in when we did checked the resulting data, and discovered that the monetary totals were way out of whack, and Power Query didn't report any errors in the data it had processed. Consider the M query below, that imports the source data from a spreadsheet - take a close look at the ClaimAmount and VATOnClaimAmount columns. See anything wrong? At first, the values appear to be normal monetary values, but on closer inspection we see that the value 131.58 is

SQL Server vs Azure Data Warehouse vs Netezza vs Redshift

The Great Database Shoot Out In Jan'17, I teamed up with Marc van der Zon (Data Scientist), to test and compare several database technologies for our organization's future analytics and BI platform. The technologies that made the shortlist were: SQL Server, because it is the organization's existing BI platform. Azure Data Warehouse, because of its high similarity to SQL Server. Amazon's Redshift, because of its attractive cost, scalability and performance. Netezza, because it is an affordable  on-prem appliance that performs well. Considerations We were primarily looking for the best bang-for-buck option, so price and performance were viewed as more important than how feature complete any particular option was. That said, what we regarded as important included: A shallow learning curve for BI developers (i.e. no need for expert skills in order to get good performance) Ability to re-use existing code  (i.e. ANSI-92 SQL compatibility) Time-to-solution

Reading Zip files in PowerQuery / M

Being a fan of PowerBI, I recently looked for a way to read zip files directly into the Data Model, and found this blog which showed a usable technique. Inspired by the possibilities revealed in Ken's solution, but frustrated by slow performance, I set out to learn the M language and write a faster alternative. UnzipContents The result of these efforts is an M function - UnzipContents - that you can paste into any PowerBI / PowerQuery report. It takes the contents of a ZIP file, and returns a list of files contained therein, along with their decompressed data: If you're not sure how to make this function available in your document, simply: Open up PowerQuery (either in Excel or in PowerBI) Create a new Blank Query. Open up the Advanced Editor  (found on the View tab in PowerBI). Copy-Paste the above code into the editor, then close the editor. In the properties window, rename the the function to  UnzipContents Usage Using the function is fairly straight forw

Virtuous Designs for Tabular Data Modelling

Much of the beauty of Microsoft's Tabular model is the apparent ability to escape weeks of star-schema modelling that are common place in OLAP cubes. While tabular can be blazingly fast both to develop models for as well as to use, the performance of the Vertipaq engine varies massively depending on how you present your data to it. Below are several data modelling patterns you are likely to encounter: The Monolithic table design involves joining all source tables together into a single denormalized representation. Tabular is able to group / aggregate and filter rows easily in this model, so while care needs to be taken when writing DAX expressions, the resulting cube will perform well. Pros: Easy to get started.  Performs well. Cons: DAX expressions trickier to write.  Cube loading times may suffer.  Only similar-grained data can be accommodated. When facts are derived from disparate sources, a monolithic design is not practical. In this case, multiple fact table

Cloning SQL tables

Plenty of folks have blogged about various techniques for cloning tables in SQL server, and for good reason... during data loading and data processing its very useful to be able to build one table while simultaneously reporting off of another. When the processing of the new table is completed, it can be switched in to replace the data of the old table. To simplify the creation of a build table, I've written a stored procedure which will take any table and clone it and its indexes:

Revisiting Earned Premium

In a  previous post about earned premium , I outlined how you could calculate a monetary value based on a period of time over which it was earned using DAX. Serendipitously, the next day a colleague forwarded Alberto Ferrari's paper on understanding DAX query plans , and after giving it a thorough read I fired up the query profiler and set out to optimize our calculated measure for earned premium. Alberto's paper details a performant solution to the classic events in progress problem, of which earned premium  is a close cousin. My excitement at lazily shoplifting Alberto's work came to a grinding halt when I discovered that his 40ms Jedi solution only worked if data was queried at a specific granularity. This wasn't going to cut it... we need an earned premium measure that works at any level of aggregation. Back to the drawing board. It turns out that much of Alberto's advice is (as always) really valuable. While I strongly recommend reading Alberto's pa