•   
  •   
  •   

Technology Streamlining data science with open source: Data version control and continuous machine learning

03:25  04 april  2021
03:25  04 april  2021 Source:   zdnet.com

Teachers to Biden: What we want from your administration

  Teachers to Biden: What we want from your administration Here's an open letter from three teachers with a list of education recommendations for the new administration. In this post, three educators who have decades of cumulative experience teaching students and teachers spell out where they hope the new administration will go in helping schools provide an education to all students that, as they explain, “is relevant and engages them in things that matter.

MLOps, short for machine learning operations, is the equivalent of DevOps for machine learning models: Taking them from development to production, and managing their lifecycle in terms of improvements, fixes, redeployments, and so on.

graphical user interface, application: CML is an open source project that aims to help facilitate the machine learning workflow © ZDNet

CML is an open source project that aims to help facilitate the machine learning workflow

Primers

  • What is AI? Everything you need to know
  • What is machine learning? Everything you need to know
  • What is deep learning? Everything you need to know
  • What is artificial general intelligence? Everything you need to know

Achieving MLOps nirvana is a major barrier to getting value out of machine learning and data science. Version control systems like Git and practices like continuous integration / continuous deployment (CI/CD) have helped operationalize software development.

Pandemic learning takes another turn: Will teachers be in person in classrooms?

  Pandemic learning takes another turn: Will teachers be in person in classrooms? School systems across the region have hired “classroom monitors” to fill staffing gaps.

What if those systems and practices could also be used for MLOps? Iterative.ai wants to address this question with open source projects Data Version Control and Continuous Machine Learning.

Bringing version control to machine learning

Data engineers, machine learning, and data science practitioners work with a wide range of data. They need to have a workflow and tools to support it to keep track of their artifacts and their versions, resolve issues, and collaborate across teams and systems.

Iterative.ai is an MLOps company dedicated to streamlining the workflow of data scientists. Today they announced the latest releases of Data Version Control (DVC) and Continuous Machine Learning (CML) open-source projects.

Iterative.ai claims DVC and CML remove the need for proprietary AI platforms by extending traditional software tools like Git and CI/CD to meet the needs of machine learning Engineers. ZDNet connected with Dmitry Petrov, CEO and founder of Iterative.ai, to find out more about DVC and CML.

Fermenting skills, exodus debunked, nurse strike: News from around our 50 states

  Fermenting skills, exodus debunked, nurse strike: News from around our 50 states How the COVID-19 pandemic is affecting every stateStart the day smarter. Get all the news you need in your inbox each morning.

text: CML is an open source project that aims to help facilitate the machine learning workflow © Provided by ZDNet CML is an open source project that aims to help facilitate the machine learning workflow

The goal of DVC is to bring agility, reproducibility, and collaboration into existing data science workflows. DVC provides users with a Git-like interface for versioning data and models, bringing version control to machine learning to address the challenges of reproducibility.

DVC is built on top of Git, allowing users to create lightweight metafiles and enabling the system to handle large files, rather than storing them in Git. It works with remote storage for large files in the cloud or on-premise network storage.

CML is an open-source library for implementing continuous integration and delivery (CI/CD) in machine learning projects. Users can automate parts of their development workflow, including model training and evaluation, comparing machine learning experiments across their project history, and monitoring changing datasets. CML will also auto-generate reports with metrics and plots in each Git pull request.

If you like small classrooms, you should love learning pods

  If you like small classrooms, you should love learning pods State and federal governments should look at ways to support families making these choices for their children.Yet many families have taken the temperature of their schools' plans and embraced a new normal. The prolonged resistance to reopening schools from many districts and union officials has helped to give education pods more staying power. State leaders should ensure families can choose this approach with minimal interference from regulators and offer some much-needed support.

Projects and products

That sounds almost too good to be true: fully open source projects that deliver that kind of functionality and value? Great, but what's the catch, and for whom? Are the projects really open source, or maybe open core -- i.e., are there proprietary parts? And what is iterative.ai's business model?

A hosted service (SaaS offering) for DVC and CML looks improbable at first blush. As Petrov noted, there is no such thing as hosted DVC or CML because they are distributed and on-premise by design like Git or Terraform. The business model, Petrov went on to add, is similar to HashiCorp:

"We build open-source tools and give them to practitioners for free. We build DVC and CML while HashiCorp builds Terraform, Vault, and others. Monetization comes from enterprise scenarios (better data access control, security, integrations, team collaboration, etc). Those are separate products on top of DVC and CML."

chart: DVC is an open source project that aims to help data engineers and machine learning practitioners use version control for their projects © Provided by ZDNet DVC is an open source project that aims to help data engineers and machine learning practitioners use version control for their projects

The other thing that struck us about the combination of DVC and CML is that they seem to pack a lot of functionality, which is actually quite complex. Most software developers, for example, don't use Git through the command line, but rather via IDEs - visual tools for software development that integrate version control on top of Git.

Amazon's new machine learning tool will help businesses spot flagging KPIs

  Amazon's new machine learning tool will help businesses spot flagging KPIs Lookout for Metrics is a fully-managed machine learning tool for monitoring business metrics and tackling dips in business performance.Lookout for Metrics uses the same machine learning technology used internally by Amazon to monitor key performance indicators (KPIs) like revenue, web page views, active users, transaction volume, and mobile app installations.

executive guide

What is machine learning? Everything you need to know © Provided by ZDNet What is machine learning? Everything you need to know

What is machine learning? Everything you need to know

Here's how it's related to artificial intelligence, how it works and why it matters.

Read More

It turns out there is an analogy here. Iterative.ai also offers DVC-Studio, packing UI, and collaboration features on top of DVC and CML. Petrov likened this to Git + GitHub. DVC-Studio is not open source, and not officially released yet either:

"Today people use DVC and CML as-is, and it's mostly a command-line experience. Without Studio, these two are still functional. Like Git and GitHub - you don't need GitHub or GitLab to use Git, but it is nice to have," said Petrov.

From a community to the enterprise

How many people do actually use DVC and CML as-is today? Quite a lot, it would seem. Iterative.ai counts 400+ companies, 4,000+ community members, plus 200+ contributors and 7000+ Github stars. Petrov also mentioned an additional 2000+ users for DVC.

Petrov, a computer science Ph.D., is a data scientist himself, previously at Microsoft - Bing. DVC was his pet project when he started it in 2017 before he incorporated iterative.ai with co-founder and ex-colleague Ivan Shcheklein.

As for today's announcement, Petrov highlighted lightweight machine learning experiments as the major feature in DVC 2.0. DVC is great for making machine learning projects reproducible but it creates some overhead, as a Git-commit is needed for each step or experiment.

Chef initiative, ice fishing trash, Tribeca plans: News from around our 50 states

  Chef initiative, ice fishing trash, Tribeca plans: News from around our 50 states How the COVID-19 pandemic is affecting every stateStart the day smarter. Get all the news you need in your inbox each morning.

graphical user interface, application: iterative.ai's product offering, based on DVC and CML © Provided by ZDNet iterative.ai's product offering, based on DVC and CML

DVC 2.0 simplifies and automates this experience. Machine learning experiments can now be created in a single command and be fully reproducible, Petrov said. Another step toward experimentation is machine learning model checkpoints and live metrics or logs.

These two are important for deep learning scenarios when you need to track the machine learning training process and use not the latest model but one of the previous models (checkpoints), Petrov added.

Today DVC and CML's adoption is purely bottom-up and community-driven. Although we do not have more details on specific enterprise use cases or iterative.ai's venture backing at this point, Petrov mentioned that plans include growing the current headcount of 19 to 30+ in 2021.

DVC and CML seem like a reasonable idea, and adoption looks promising. It's worth keeping an eye on the projects, as well as iterative.ai, to see how traction translates to enterprise use and sustainability.

Big Data

  • Azure Synapse Analytics: A progress report
  • 5 reasons AI isn't being adopted at your organization (and how to fix it)
  • InfluxData becomes the embedded IoT database for PTC, adds Azure support
  • How MIT and IBM are fighting COVID-19 with AI (ZDNet YouTube)
  • Google collects a frightening amount of data about you. You can find and delete it now. (CNET)
  • One key to hiring your next data scientist (TechRepublic)

Machine learning can help keep the global supply chain moving .
Disruptions can come from almost anywhere, and keeping goods moving is critical. One expert says AI and machine learning can help.SEE: AI on the high seas: Digital transformation is revolutionizing global shipping (free PDF) (TechRepublic)

usr: 6
This is interesting!