Week 3 — Data Journey

Data Storag – Data Journey

  • Data transforms as it flows through the process
  • Interpreting model results requires understanding data transformation

Artifacts and the ML pipeline

  • Artifacts are created as the component of the ML pipeline execute
  • Artifacts include all of the data and objects which are produced by the pipeline components.
  • This includes the data, in different stages of transformation, the schema, the model itself, metrics, etc.

Data provenance and lineage

  • The chain of transformations that led to the creation of a particular artifact
  • Important for debugging and reproducibility.

Data lineage: data protection regulation

  • Organizations must closely track and organize personal data
  • Data lineage is extremely important for regulatory compliance

Data versionining

  • Data pipeline management is a major challenge
  • Machine learning requires reproducibility
  • Code versioning: GitHub and similar code repositories
  • Environment versioning: Docker, Terraform, and similar
  • Data versioning:
    • Version control of datasets
    • e.g., DVC, Git-LFS

Introduction to ML Metadata

  • Every run of a production ML pipeline generates metadata containing information about the various pipeline components, their executions and the resulting artifacts.

ML Metadata library

  • Tracks metadata flowing between components in pipeline
  • Supports multiple storage backends

ML Metadata terminology

Artifact: An artifact is an elementary unit of data that gets fed into the ML metadata store and as the data is consumed as input or as generated as output of each component.

Execution: Each execution is a record of any component run during the ML pipeline workflow, along with its associated runtime parameters.

Context: A context may hold the metadata of the projects being run, experiments being conducted, details about pipeline, etc. It captures the shared information within the group. For example: project name, changelist commit id, experiment annotations etc. Artifacts and executions can be clustered together for each type of component separately.

  • Each units can be of several different types having different properties stored in ML metadata.
  • Relationships store the various units getting generated or consumed when interacting with other units.
    • Like Event is the record of a relationship between an artifact and an execution.

Metdata Stored

Inside MetadataStore

ML MetaData in Action

!pip install ml-metadata
from ml_metadata import metadata_store
from ml_metadata.proto import metadata_store_pb2

ML Metadata storage backend

  • ML metadata registers metadata in a database called Metadata Store
  • APIs to record and retrieve metadata to and from the storage backend
    • Fake database: in-memory for fast experimentation/prototyping
    • SQLite: in-memory and disk
    • MySQL: server based
    • Block storage: File system, storage area network, or cloud based

Fake database

connection_config = metadata_store_pb2.ConnectionConfig()
# Set an empty fake database proto

store = metadata_store.MetadataStore(connection_config)


connection_config = metadata_store_pb2.ConnectionConfig()
connection_config.sqlite.filename_uri = '...'
connection_config.sqlite.connection_mode = 3 # READWRITE_OPENCREATE

store = metadata_store.MetadataStore(connnection_config)


connection_config = metadata_store_pb2.ConnectionConfig()

connection_config.mysql.host = '...'
connection_config.mysql.port = '...'
connection_config.mysql.database = '...'
connection_config.mysql.user = '...'
connection_config.mysql.password = '...'

store = metadata_store.MetadataStore(connnection_config)

Schema Development

Schema are relational objects summarizing the features in a given dataset or project. This includes:

  • Feature name
  • Type: float, int, string, etc
  • Required or optional
  • Valency (feature with multiple values)
  • Domain: range, categories
  • Default values

Reliability during data evolution

Your system and your development process must treat data errors as first-class citizens, just like code bugs.

Iteratively update and fin-tune schema to adapt to evolving data

Platform needs to resilient to disruptions from:

  • inconsistent data
  • pipeline needs to gracefully handle software generated errors
  • user misconfigurations
  • uneven execution environments

Scalability during data evolution

Anomaly detection using Data evolution

Schema inspection during data evolution

Schema Environments

  • You my have different schema versions for different environments, like development and testing.

Maintaining varieties of schema

Inspect anomalies in serving dataset

stats_options = tfdv.StatsOptions(schema=schema, 
eval_stats = tfdv.generate_statistics_from_csv(

serving_anomalies = tfdv.validate_statistics(eval_stats, schema)

Schema environments

  • Customize the schema for each environment
  • e.g., Add or remove label in schema based on type of dataset

# Modify serving environment feature, 
# removing from 'SERVING' environment as it will not be there
tfdv.get_feature(schema, 'Cover_Type').not_in_environment.append('SERVING')

Now Inspecting anomalies will return no anomalies

serving_anomalies = tfdv.validate_statistics(
# No anomalies found

Enterprise Data Storage — Feature Stores

A feature store is a central repository for storing documented, curated and access controlled features.

A feature store make it easy for different teams to share, discover and consume highly curated features. Many modelling problems might use identical or similar features across organization, thus having a feature store greatly reduces redundant work. This also enables teams to share and discover data.

Key aspects

  • Managing feature data from a single person to large enterprises
  • Scalable and performant access to feature data inn training and serving.
  • Provide consistent and point-in-time correct access to feature data.
  • Enable discovery, documentation, and insights into your features.

Data Warehouse

Data warehouse is a technology that aggregates data from one or more sources so that it can be processed and analyzed. It’s optimized for long running batch jobs and read operations. They are not designed to placed in a production environment.

Key features of data warehouse

  • A data warehouse is subject oriented and information stored in it revolves around a topic.
  • The data might be collected from various types of sources, like RDBMS, files etc.,
  • Non-volatile, previous versions of data is not erased when new data is added
  • Time variant, can let you go through timestamped data.


  • Enhanced ability to analyze the data, without worrying about serving performance degradation.
  • Timely access to data
  • Enhanced data quality and consistency
  • High return on investment
  • Increased query and system performance

Comparison with databases

Data Lakes

A data lake is a system or repository of data stores in its natural and raw format.

  • A data lake, like warehouse aggregates data from various sources of enterprise data
  • Data can be structured or unstructured
  • Doesn’t involve any processing before writing data, don’t have schema