Every two weeks, the Data Planet brings you the latest and most fascinating news to watch in Data Analytics and BI, offering up quick snackable summaries so you can skip straight to the point.
This week’s edition of the Data Planet includes:
- Why Software is Eating the World
- The Modern Data Stack: Past, Present, and Future
- Software Spotlights:
- Microsoft Power BI Best Practice Analyzer
- Dynamic Data Masking with Snowflake
- spaCy 3.0
- H20.ai
Why Software is Eating the World
Marc Andreessen originally wrote this article ten years ago. It has become very well known and spawned a cottage industry of “x is Eating y” articles. Its primary thesis is that anything that can become software eventually will.
Our key takeaways:
- The article’s predictions have held up well over the years, and we wonder if we can expect the same from artificial intelligence(AI) and data?
- There has been a lot of hype around AI, but we have yet to see it have the same industry-rocking effects that software had.
- What do you think? Connect with us on social media and join the conversation.
Read Andreessen’s original article here
The Modern Data Stack: Past, Present, and Future
This article is written by Tristan Handy, CEO of Fishtown Analytics and one of the makers of DBT. He explores the last ten years of what he calls the modern data stack and how it has evolved.
Our key takeaways:
- The article posits that cloud data warehouses like AWS Redshift and Snowflake have loosened constraints that have plagued our industry for decades – we now have access to near infinite compute and storage.
- The second part of the article discusses how, despite significant advances, there is still a large amount of work to be done in data warehouse tooling.
Check out the article for yourself here
Software Spotlight: Microsoft Power BI Best Practice Analyzer
Power BI’s Best Practice Analyzer uses rules to improve your model performance through an automated check of a long series of rules against your tabular model.
Our key takeaways:
- The Best Practice Analyzer can codify Power BI and tabular modeling best practices into a single place, alerting you of modeling issues as you develop your model.
- With the addition of a few rules, the Tabular Editor can scan your entire model against each rule and provide a list of objects which satisfy the condition of each rule.
- This can be done manually but takes far longer – you can save yourself loads of time with the predefined rules outlined in the Microsoft article below. You’ll also find instructions for loading the rules into Tabular Editor, checking details, and fixing issues.
Check out the announcement from Microsoft here
Software Spotlight: Dynamic Data Masking with Snowflake
Data masking masks sensitive information without changing the data underneath. Organizations use it to ensure that data is properly protected to meet strict enterprise and legal requirements. Snowflake has introduced a new Dynamic Data Masking feature ideal for those organizations.
Our key takeaways:
- Snowflake implemented the new feature by introducing a data masking policy on a VARIANT data type field that holds data in JSON format, on top of tables and views.
- Data masking on external tables and various additional standard file formats (like CSV, Avro, Orc, and Parquet) is also supported.
- How it works:
- Masking happens at query runtime, so there’s no need to have a second data source to store the masked data.
- This is a column-level security feature using first-class policy objects to selectively mask.
- Offers a flexible and extensible policy framework that empowers customers to define their own authorization logic as declarative policies.
- It can be used to hash values on the fly or with DECRYPT on previously encrypted data with either ENCRYPT or ENCRYPT_RAW.
Snowflake explains how to apply a masking policy on semi-structured data in this article
Software Spotlight: spaCy 3.0
spaCy is an open-source Python library for natural language processing. It recently launched version 3.0, which contains a wide variety of functions useful for analyzing text and prepping it for machine learning.
Our key takeaways:
- Main goal of updates was to make it easier to bring your own models into spaCy, especially state-of-the-art models like transformers.
- This spaCy release offers some powerful new functions:
- Part-of-speech (POS) Tagging: assigning word types to tokens, like verb or noun.
- Lemmatization: assigning the base forms of words. For example, the lemma of was is be and the lemma of rats is rat.
- Named Entity Recognition (NER): labelling named “real-world” objects like persons, companies, or locations.
- Similarity: comparing words, text spans, and documents and how similar they are to each other. - Use the new installation quickstart widget to find detailed installation instructions for your platform and setup.
If you haven’t done any work with natural language processing, then spaCy is a good place to start. It can be overwhelming at first but there are plenty of guide and tutorials out there to get you started.
Learn more here.
Software Spotlight: H20.ai
H2O.ai is a machine learning platform company. They have many products, but their two flagships are their open source H2O.ai library and their Driverless AI package.
Our key takeaways:
- The H2O library is open-source and works with both Python and R.
- Driverless AI is a nifty auto machine learning solution but also comes with a nifty price tag, as well.
- Their biggest differentiator is that they can compile machine learning models into Java files which can be embedded into applications.
Learn more here