Using the measures framework🔗
Introduction🔗
The measures framework is used to calculate quotients (i.e. a numerator divided by a denominator) and to see how these vary over time and when broken down by different groupings.
The numerators, denominators and groups are all defined using ehrQL queries, just like we would use with a dataset definition. But the way those definitions are used is a bit different.
To explain the concepts involved we'll start with a basic but complete example and walk through each of the elements.
Basic example🔗
Suppose we want to know what proportion of the patients prescribed atorvastatin tablets in a given month were prescribed 80mg tablets, with the results broken down by sex and calculated for each of the first six months of 2022. Here's how we'd do that using the measures framework:
Code🔗
from ehrql import INTERVAL, case, create_measures, months, when
from ehrql.tables.core import medications, patients
# Every measure definitions file must include this line
measures = create_measures()
# Disable disclosure control for demonstration purposes.
# Values will neither be suppressed nor rounded.
measures.configure_disclosure_control(enabled=False)
# Small codelist for demonstration purposes; the real list would be longer
atorvastatin_tablets = [
"39733211000001101",
"39695411000001103",
"39733011000001106",
"39733111000001107",
]
atorvastatin_80mg_tablets = [
"39733211000001101",
]
# The use of the special INTERVAL placeholder below is the key part of
# any measure definition as it allows the definition to be evaluated
# over a range of different intervals, rather than a fixed pair of dates
rx_in_interval = medications.where(
medications.date.is_during(INTERVAL)
)
atorvastatin_rx = rx_in_interval.where(
medications.dmd_code.is_in(atorvastatin_tablets)
)
atorvastatin_80_rx = rx_in_interval.where(
medications.dmd_code.is_in(atorvastatin_80mg_tablets)
)
has_recorded_sex = patients.sex.is_in(["male", "female"])
measures.define_measure(
name="atorva_80",
numerator=atorvastatin_80_rx.exists_for_patient(),
denominator=atorvastatin_rx.exists_for_patient() & has_recorded_sex,
group_by={
"sex": patients.sex
},
intervals=months(6).starting_on("2022-01-01"),
)
Running the example🔗
You can save this file as measure_definition.py
and then run the generate-measures
command on it:
opensafely exec ehrql:v1 generate-measures measure_definition.py --output measures.csv
Results🔗
This should produce a file called measures.csv
whose contents look something like this:
measure | interval_start | interval_end | ratio | numerator | denominator | sex |
---|---|---|---|---|---|---|
atorva_80 | 2022-01-01 | 2022-01-31 | 0 | 0 | 5 | female |
atorva_80 | 2022-01-01 | 2022-01-31 | 0.2 | 2 | 10 | male |
atorva_80 | 2022-02-01 | 2022-02-28 | 0 | 0 | 3 | female |
atorva_80 | 2022-02-01 | 2022-02-28 | 0.2 | 1 | 5 | male |
atorva_80 | 2022-03-01 | 2022-03-31 | 0.5 | 3 | 6 | female |
atorva_80 | 2022-03-01 | 2022-03-31 | 0.2 | 2 | 10 | male |
atorva_80 | 2022-04-01 | 2022-04-30 | 0.444 | 4 | 9 | male |
atorva_80 | 2022-04-01 | 2022-04-30 | 0.125 | 1 | 8 | female |
atorva_80 | 2022-05-01 | 2022-05-31 | 0.286 | 2 | 7 | male |
atorva_80 | 2022-05-01 | 2022-05-31 | 0.2 | 1 | 5 | female |
atorva_80 | 2022-06-01 | 2022-06-30 | 0.25 | 2 | 8 | male |
atorva_80 | 2022-06-01 | 2022-06-30 | 0.5 | 5 | 10 | female |
Here the measure
column always has the same value (atorva_80
) because we only have a single measure defined, but if we had defined multiple measures then this would tell us which measure each row relates to.
The interval_start
and interval_end
columns show the date range (inclusive) covered by each row. In this case there are 12 rows because each of the six months appears twice: once where sex=male
and again where sex=female
.
The denominator
column gives the number of patients for each interval-sex combination which match our denominator definition. The numerator
column gives the number of patients which also match the numerator definition. That is, a patient must match both the denominator and numerator definitions to be included in the numerator.
The ratio
is simply numerator
divided by denominator
.
The sex
column gives the value of the expression we supplied when defining group_by
. Had we defined any other groupings (say age_band
) then they would appear as additional columns here.
Core concepts🔗
Defining a measure🔗
A measure definition always starts by creating a measures collection object:
measures = create_measures()
Each individual measure is then defined by calling measures.define_measure()
:
measures.define_measure(
name="atorva_80",
numerator=atorvastatin_80_rx.exists_for_patient(),
denominator=atorvastatin_rx.exists_for_patient() & has_recorded_sex,
group_by={
"sex": patients.sex
},
intervals=months(6).starting_on("2022-01-01"),
)
The name
argument is just used so we can identify our measure in
the output. It can be anything you like so long as it contains only
alphanumeric and underscore characters and starts with a letter.
The denominator
argument defines the condition that patients must
match to be included in the denominator i.e. it is a boolean patient
series. (It is also possible
to supply an integer here, but we'll cover this later.)
The numerator
arguments defines the additional condition that
patients must also meet to be included in the numerator. (Again, this
can also be an integer as we'll discuss later.)
The group_by
argument defines how we would like our results broken
down. It's optional – leaving it out means that we'll get a single row
for each time interval – but most measures define at least one set of
groups. It is supplied as a dictionary mapping group names to group
definitions.
As we saw in the example above, each group name defined here ends up as column in the results. Each group definition should be a "categorical" patient series, that is: a patient series which takes only a fixed set of values.
The interval
argument defines the time periods over which the
measure will be calculated. This is given as a list of start date/end
date pairs but, as typing these all out by hand would be laborious, we
provide several convenience functions for generating such lists:
years
,
months
and
weeks
.
The INTERVAL
placeholder🔗
The key difference between the ehrQL we'd write for a dataset definition and the ehrQL we write for a measure comes in this line here:
rx_in_interval = medications.where(
medications.date.is_during(INTERVAL)
)
This filters the medications table to include just those prescribed during "the current interval" without specifying exactly what the interval is.
In a dataset definition we would need to reference a specific pair of dates here e.g.:
rx_in_interval = medications.where(
medications.date.is_on_or_between("2022-01-01", "2022-01-31")
)
But the query we use in our measure definition needs to be evaluated over a range of date intervals, not just one. We can handle this by using the special INTERVAL
value as a placeholder for the start and end dates of the intervals:
rx_in_interval = medications.where(
medications.date.is_on_or_between(INTERVAL.start_date, INTERVAL.end_date)
)
However, it's a bit cumbersome to have to type is_on_or_between()
, start_date
and end_date
every time so we provide is_during()
as a convenient shorthand:
rx_in_interval = medications.where(
medications.date.is_during(INTERVAL)
)
Disclosure control🔗
By default, numerators and denominators are subject to disclosure control. First, values less than or equal to seven are replaced with zero (suppressed); then, values are rounded to the nearest five.
We disabled disclosure control with this line here:
measures.configure_disclosure_control(enabled=False)
However, we should carefully consider whether we wish to disable disclosure control when running inside of the secure environment. If we don't, then we should remove that line.
Additional notes🔗
Multiple measures🔗
We're not restricted to just one measure per file. If we wanted to add another measure using the same numerator and denominator but this time broken down by age band, then we would start by defining an age band variable like this:
age = patients.age_on(INTERVAL.start_date)
age_band = case(
when((age >= 0) & (age < 20)).then("0-19"),
when((age >= 20) & (age < 40)).then("20-39"),
when((age >= 40) & (age < 60)).then("40-59"),
when((age >= 60) & (age < 80)).then("60-79"),
when(age >= 80).then("80+"),
)
And then we can define a new measure exactly like the previous one but
with different name
and using age_band
in our group_by
argument:
measures.define_measure(
name="atorva_80_by_age",
numerator=atorvastatin_80_rx.exists_for_patient(),
denominator=atorvastatin_rx.exists_for_patient() & has_recorded_sex,
group_by={
"age_band": age_band
},
intervals=months(6).starting_on("2022-01-01"),
)
Removing duplication🔗
As our second measure shares so much in common with our first it seems redundant to have to specify the numerator, denominator and intervals all over again. Furthermore, if we ever wanted to make changes to these definitions we'd need to make the change in multiple places to keep them consistent.
To avoid this we can use
measures.define_defaults()
to set values which we know are going to be common between all the
measures in our file:
measures.define_defaults(
numerator=atorvastatin_80_rx.exists_for_patient(),
denominator=atorvastatin_rx.exists_for_patient() & has_recorded_sex,
intervals=months(6).starting_on("2022-01-01"),
)
And then we can define our two measures using:
measures.define_measure(
name="atorva_80",
group_by={
"sex": patients.sex
},
)
measures.define_measure(
name="atorva_80_by_age",
group_by={
"age_band": age_band
},
)
Grouping by multiple features🔗
The above example defines two separate measures, one grouping by sex and the other by age band. But it is also possible to define a single measure that groups by sex and age band simultaneously:
measures.define_measure(
name="atorva_80_by_age_and_sex",
group_by={
"sex": patients.sex,
"age_band": age_band,
},
)
This will produce a row of output for each possible combination of sex
and age band: male,0-19
female,0-19
, male,20-39
, female,20-39
and so on for all ten combinations.
As we defined our measure to cover monthly intervals over a six month period, this means that this single measure will produce 60 rows – ten for each month.
Dummy data🔗
When run outside of the secure environment ehrQL will generate dummy
data for measures just as it does with datasets. However, note that
generating meaningful results for measures defined over a large time
period and with many different groupings can require significantly more
dummy data than would be needed for a dataset definition. The measures
framework will make a crude attempt to guess how many dummy patients it
should generate, but you may need to adjust this number using the
measures.configure_dummy_data()
method.
For more information about using dummy data with measures please see our how-to guide.