Skip to content

Understanding Metrics

A metric is a calculated measure that aggregates data in some meaningful way. Think revenue, profit margin, average order size, customer count—basically any aggregated value that helps you analyze your data.

In Cube Alchemy, metrics are defined once and stored within the cube object for later use:

  1. You define a metric using cube.define_metric() providing at least a name, expression, and aggregation method

  2. The metric is stored in the cube object

  3. Later, you reference metrics by name when defining queries

Building a Metric

Every metric needs three essential components:

  1. Name: A clear, descriptive label for the metric (e.g., 'Revenue', 'Customer Count')

  2. Expression: The calculation formula, using dimension references inside square brackets (e.g., [qty] * [price])

  3. Aggregation: How to combine values—standard methods like sum, mean, count, or custom functions

# Step 1: Define your metrics
cube.define_metric(
    name='Revenue',
    expression='[qty] * [price]',
    aggregation='sum'
)

cube.define_metric(
    name='Average Order Value', 
    expression='[price]',
    aggregation='mean'
)

cube.define_metric(
    name='Number of Orders',
    expression='[order_id]',
    aggregation=lambda x: x.nunique()
)

# Step 2: Define a query that uses these metrics
cube.define_query(
    name="sales_performance",
    dimensions={'region', 'product_category'},
    metrics=['Revenue', 'Average Order Value', 'Number of Orders']
)

# Step 3: Execute the query by referencing its name
result = cube.query("sales_performance")

# The metrics are calculated and returned as columns in the result DataFrame

Syntax Rules

  • Column References: Columns in metric expressions must be enclosed in square brackets: [qty], [price], [cost], etc.

  • Aggregation Methods: The aggregation parameter accepts:

    • Pandas group by strings: 'sum', 'mean', 'count', 'min', 'max', etc.

    • You can extend the pandas group by strings by registering custom functions, which will aggregate over the resultin pandas series. See below.

    • Custom callable functions: lambda x: x.quantile(0.95) or any function that accepts a pandas Series.

NOTE: Avoid passing custom callabes if you plan to work with YAML (or other source) catalog to persist these definitions. After registering it you can use it.

Derived Metrics

Derived metrics are calculated after aggregation has already occurred. While regular metrics aggregate over dimensions, derived metrics work with already aggregated results, letting you create ratios, percentages, and other derivative calculations.

# First define the metrics needed for the computation
cube.define_metric(
    name='Revenue',
    expression='[qty] * [price]',
    aggregation='sum'
)

cube.define_metric(
    name='Cost',
    expression='[qty] * [unit_cost]',
    aggregation='sum'
)

# Then define a derived metric that uses them
cube.define_derived_metric(
    name='Profit Margin %',
    expression='([Revenue] - [Cost]) / [Revenue] * 100'
)

# Use both regular and derived metrics in queries
cube.define_query(
    name="profitability_analysis",
    dimensions={'product_category', 'region'},
    derived_metrics=['Profit Margin %']
)

# Execute the query
result = cube.query("profitability_analysis")

The workflow is:

  1. Define regular metrics that perform aggregation

  2. Define derived metrics that reference those aggregated metrics

  3. Include them in your queries (derived metrics are passed separately)

Advanced Features

For more sophisticated analysis, metrics support several powerful options:

  • Different context states: Calculate metrics in different filtering environments

  • Metric filters: Apply specific filters only for a particular metric

  • Row conditions: Pre-filter rows before calculating the metric

  • Ignore Dimensions: Control dimensional aggregation behavior

  • Ignore Context Filters (per metric): Let a metric ignore all or some of the active context filters

  • Nested Aggregations: two-step aggregations (see below).

  • Custom functions: Use your own Python functions for complex logic

Each of these options allows you to create highly specialized metrics that can answer specific more sophisticated questions.

# Only count high-value orders
cube.define_metric(
    name='High Value Orders',
    expression='[order_id]',
    aggregation='count',
    row_condition_expression='[price] > 100'
)

# Revenue only from specific regions (metric-level filter)
cube.define_metric(
    name='Regional Revenue',
    expression='[qty] * [price]',
    aggregation='sum',
    metric_filters={'region': ['North', 'West']}
)

# Ignore all context filters only for this metric (computed over the Unfiltered state)
cube.define_metric(
    name='Revenue (All Context)',
    expression='[qty] * [price]',
    aggregation='sum',
    ignore_context_filters=True
)

# Ignore only some context filters (e.g., ignore the country filter, respect the rest)
cube.define_metric(
    name='Revenue (Ignoring Country Filter)',
    expression='[qty] * [price]',
    aggregation='sum',
    ignore_context_filters=['country']
)

# Create a new Context State
cube.set_context_state('My New Context')

# Apply different filters to it
cube.filter(
    {'date': ['2024-10-01', '2024-11-01', '2024-12-01']},
    context_state_name='My New Context'
)

# Define a metric using the new context
cube.define_metric(
    name='High Value Orders',
    expression='[order_id]',
    aggregation='count',
    context_state_name='My New Context'
)

# Ignore_dimensions from aggregation
cube.define_metric(
    name='Total Revenue',
    expression='[qty] * [price]',
    aggregation='sum',
    ignore_dimensions=True  # Ignores all dimensions, returns same value for all rows
)

# Ignore specific dimensions from aggregation
cube.define_metric(
    name='Country Revenue',
    expression='[qty] * [price]',
    aggregation='sum',
    ignore_dimensions=['city', 'product', 'date']  # Ignore these dimensions when aggregating
)

# Define a query with these metrics
cube.define_query(
    name="advanced_analysis",
    dimensions=set(my_query_dimensions),
    metrics=['High Value Orders', 'Regional Revenue', 'High Value Orders']
)

# Execute the query
result = cube.query("advanced_analysis")

Ignore Context Filters

When a metric specifies ignore_context_filters:

  • If set to True, the metric is evaluated against the Unfiltered context (equivalent to using context_state_name='Unfiltered' for that metric). Any metric_filters on the metric still apply.

  • If set to a list of dimensions, the metric is evaluated against its context state with those specific filters removed; then the metric's own metric_filters (if any) are applied on top.

Nested aggregation

Nested aggregation runs the metric in two stages so you can build staged calculations (for example: "average of per-product totals"). Use this when you need to aggregate at a finer grouping first, then combine those intermediate results into the final value.

How it works:

  1. Stage 1: compute the metric by grouping on the metric's effective dimensions plus the nested.dimensions. This produces one value per group defined by those combined dimensions.

  2. Stage 2: aggregate the stage-1 results by the metric's effective dimensions (usually the dimensions in your query). The metric's main aggregation setting is used in this final step.

Key points:

  • nested.aggregation is the aggregation used in stage 1; the metric's aggregation is applied in stage 2.
  • nested.dimensions only change how the first stage groups the rows.
  • The metric's parameter ignore_dimensions applies to both stages and defines the metric's effective dimensions.

Example:

cube.define_metric(
    name='Avg Product Revenue',
    expression='[qty] * [price]',
    aggregation='mean',                # final aggregation over product sums
    nested={'dimensions': ['product'], 'aggregation': 'sum'},
    ignore_dimensions=['store']         # compute at a higher level and broadcast to store rows
)

Explanation: this computes total revenue per product first (stage 1, sum by product), then takes the average of those product totals across the query dimensions (stage 2, mean).

Shared columns: counts and distincts

When multiple tables share a column (for example, customer_id), Cube Alchemy builds a link table containing the distinct values of that column across all participating tables. This has two practical implications:

  • Counting on the shared column name (e.g., customer_id) uses the link table and therefore reflects distinct values in the current filtered context across all tables that share it.

  • Counting on a per-table renamed column (e.g., customer_id <orders> or customer_id <customers>) uses that table’s own column values. The result can differ from the shared-column count because it’s scoped to that single table's values and is not the cross-table distinct set.

Example idea:

  • Count distinct customer_id (shared) → distinct customers across all linked tables.

  • Count distinct customer_id <orders> → distinct customers present in the Orders table specifically.

Choose the one that matches your analytical intent: cross-table distincts via the shared column, or table-specific distincts via the renamed columns. Note: per-table renamed columns are available only when rename_original_shared_columns=True on the Hypercube initialization; set it to False to drop them and reduce memory/processing if you don’t need that analysis.

Custom Functions

When your analysis requires logic that goes beyond basic arithmetic, you can register and use custom Python functions:

  1. Define a Python function that performs your specialized calculation

  2. Register the function with your cube using cube.add_functions()

  3. Reference the function in your metric expressions using the @function_name syntax (Pandas and Numpy are already registered as pd and np).

This powerful feature allows you to implement virtually any calculation logic while keeping your metric definitions clean and readable.

import numpy as np

# Define and register a safe division function
def safe_division(numerator, denominator, default=0.0):
    """Safely divide two arrays, handling division by zero"""
    result = numerator / denominator
    return result.replace([np.inf, -np.inf], np.nan).fillna(default)

# Register the function with your hypercube
cube.add_functions(safe_division=safe_division)

# Use it in a metric definition
cube.define_metric(
    name='Profit Margin %',
    expression='@safe_division([revenue] - [cost], [revenue]) * 100',
    aggregation='mean'
)

# Another example: categorizing data
def categorize_revenue(revenue_values):
    """Categorize revenue into tiers"""
    conditions = [
        revenue_values < 1000,
        (revenue_values >= 1000) & (revenue_values < 5000),
        revenue_values >= 5000
    ]
    choices = ['Low', 'Medium', 'High']
    return np.select(conditions, choices, default='Unknown')

cube.add_functions(categorize_revenue=categorize_revenue)

# Use for conditional logic - count how many sales fall into each tier
cube.define_metric(
    name='Revenue Tier Count',
    expression='@categorize_revenue([qty] * [price])',
    aggregation=lambda x: len(x)
)

# Or get the most common revenue tier
cube.define_metric(
    name='Most Common Revenue Tier',
    expression='@categorize_revenue([qty] * [price])',
    aggregation=lambda x: x.value_counts().index[0]
)

# All pandas and numpy functions are already registered functions and can be used right away by calling pd or np

# for example if you have this metric
cube.define_metric(
    name='High Value Orders',
    expression='[order_id]',
    aggregation='count',
    row_condition_expression='[price] > 100'
)

# it can be defined also using numpy inside the expression

cube.define_metric(
    name='High Value Orders',
    expression='@np.where([price] > 100, [order_id], np.nan)',
    aggregation='count'
)

# Will retrieve the exact same value

# Advanced handling of missing values with @ functions
# While the basic fillna parameter fills all metric columns with the same value:
cube.define_metric(
    name='Revenue',
    expression='[qty] * [price]',
    aggregation='sum',
    fillna=0  # Fills both [qty] and [price] with 0
)

# For column-specific NA handling, use @ functions directly in the expression:
cube.define_metric(
    name='Revenue with Custom NA Handling',
    expression='@pd.Series([qty]).fillna(1) * @pd.Series([price]).fillna(0)',
    aggregation='sum'  # Fills [qty] with 1 and [price] with 0
)

# Using np.where for conditional NA handling:
cube.define_metric(
    name='Revenue with Conditional Defaults',
    expression='@np.where(@pd.isnull([qty]), 0 , [qty])',
    aggregation='sum'
)

# You can inspect available registered functions:
cube.function_registry

Note on Function Handling

Metric aggregation allows you to pass custom functions directly while expression functions need to be registered on the hypercube first and referenced with @function_name. This difference exists due to their roles in the processing pipeline:

  • Expression functions operate inside within dataframe rows before aggregation

  • Aggregation functions work outside on the grouped data

So you can in fact pass even lambda functions to the aggregations. Please bear in mind that if not registered, these functions cannot be persisted into model_catalog Sources, for instance you define a metric using a lambda function then < lambda > is stored on your YAML model catalog and on the way back when trying to read it python will not know how to interpret that. So if working with external model catalog is a good practice to register all the functions that are going to be used in your hypercube.