The latest Cortana Intelligence and Machine Learning Blog from Microsoft’s team mentions the release of two new utilities that can boost the Data Science Productivity. The newly released Data Science Productivity tools are Interactive Data Exploration, Analysis and Reporting (ISEAR) and Automated Modeling and Reporting (AMAR). Thanks to these two tools, data scientists can increase their productivity efficiently.
IDEAR and AMAR Data Science Productivity tools
The two tools Interactive Data Exploration, Analysis and Reporting and Automated Modeling and Reporting will typically solve the following questions. Whenever data scientists need to deal with a new dataset, they have to answer following questions.
- What does the data look like? What’s the schema?
- What’s the quality of the data? What’s the severity of missing data?
- How are individual variables distributed? Do I need to do variable transformation?
- How relevant is the data is to the machine learning task? How difficult is the machine learning task itself?
- Which variables are most relevant to the machine learning target?
- Is there any specific clustering pattern in the data?
- How will ML models on the data perform? Which variables are significant in the models?
In fact, data scientists have to write the codes in such a way that it will answer the above questions. This task is not only critical but also time-consuming. The new Data Science Productivity tools will be able to solve the above questions.
What are the Data Science Productivity tools
IDEAR: IDEAR stands for ‘Interactive Data Exploration, Analysis and Reporting’ tool. The Cortana Intelligence and Machine Learning Blog team mentions:
“IDEAR is one of the Data Science productivity tools that help data scientists explore, visualize and analyze data. The tool also helps provide insights into the data in an interactive manner.”
Here are the features of IDEAR data science productivity tools:
- Automatic Variable Type Detection
- Variable Ranking and Target Leaker Identification
- Visualizing High-Dimensional Data
AMAR: AMAR stands for ‘Automated Modeling and Reporting’ tool. The team mentions that
“AMAR is a customizable tool to train machine learning models with hyper-parameter sweeping, compare the accuracy of those models, and look at variable importance.”
A parameter input file is used to specify which models to run, what part of the data is to be used for training and testing, the parameter ranges to sweep over, and the strategy for best parameter selection. When the tool completes its run, a standard HTML model report is generated. The report consists of:
- A view of the top few rows of the dataset used for training.
- The training formula used to create the models.
- The accuracy of various models (AUC, RMSE, etc.), and a comparison of the same, i.e. if multiple models are trained.
- Variable importance ranking.
If you want to try these data science productivity tools, you can clone the GitHub repository. Read more about the IDEAR and AMAR Data Science Productivity tools on Microsoft TechNet.