Analysing DVLA Data for Vintage Comebacks

GitHub repo: https://github.com/jamesdeluk/data-projects/tree/main/dvla-vehicle-statistics

Intro

The DVLA website provides vehicle licensing data, updated every few months. As a car and data enthusiast, I wanted to have a poke around. I started with two primary questions:

What are the most popular vehicles in the UK?
Which old vehicles are having a comeback? → This might suggest they will appreciate in value!

Selecting the dataset

There are a few different datasets, available here: https://www.gov.uk/government/statistical-data-sets/vehicle-licensing-statistics-data-files

The one(s) I decided to use were:

df_VEH0124_AM: Vehicles at the end of the year by licence status, body type, make (A to M), generic model, model, year of first use and year of manufacture: United Kingdom
Scope: All licensed vehicles in the United Kingdom with Make starting with A to M; annually from 2014
Schema: BodyType, Make, GenModel, Model, YearFirstUsed, YearManufacture, LicenceStatus, [number of vehicles; 1 column per year]

and

df_VEH0124_NZ: Vehicles at the end of the year by licence status, body type, make (N to Z), generic model, model, year of first use and year of manufacture: United Kingdom.
Scope: All licensed vehicles in the United Kingdom with Make starting with N to Z; annually from 2014
Schema: BodyType, Make, GenModel, Model, YearFirstUsed, YearManufacture, LicenceStatus, [number of vehicles; 1 column per year]

This includes everything I need, although it only goes back a decade. Another dataset does go back to 1994:

df_VEH0120_GB: Vehicles at the end of the quarter by licence status, body type, make, generic model and model: Great Britain
Scope: All registered vehicles in Great Britain; from 1994 Quarter 4 (end December)
Schema: BodyType, Make, GenModel, Model, Fuel, LicenceStatus, [number of vehicles; 1 column per quarter]

This has two main drawbacks, relative to the 124 dataset:

Rather than annual data, this has quarterly data. In theory, this sounds better, as it’s more precise; however, given the cyclicity of car buying habits across the year, it means the numbers are more like sine waves rather than smooth lines, making any forecasting or trend-detection more difficult. For my questions, there’s no benefit for having the more granular data.
This doesn’t include year of manufacture. This means all BMW M3s are grouped together - so an E30 M3 registered in 2023 will be grouped with a brand-new G80. One is a collectable; one is not.

Pre-processing the data

The first step is to simply join the two CSVs:

data_am = pd.read_csv("df_VEH0124_AM.csv")
data_nz = pd.read_csv("df_VEH0124_NZ.csv")
data = pd.concat([data_am, data_nz], axis=0)

This is the first row:

BodyType	Make	GenModel	Modela	YearFirstUsed	YearManufacture	LicenceStatus	2023	2022	2021	2020	2019	2018	2017	2016	2015	2014
Cars	ABARTH	ABARTH 124	124 GT MULTIAIR	2019	2019	Licensed	5	5	5	6	6	0	0	0	0	0

What does this tell us? In 2023, five Abarth 124 GT MultiAirs built in 2019 were licensed. Same in 2022 and 2021 (presumably the same vehicles), and 6 in 2019 and 2020. Not the most popular car!

I needed to do some minor data modifications to enable future analysis:

data.loc[:,'YearFirstUsed'] = data['YearFirstUsed'].replace('[x]',0)
data['YearFirstUsed'] = data['YearFirstUsed'].astype(int)
data.loc[:,'YearManufacture'] = data['YearManufacture'].replace('[x]',0)
data['YearManufacture'] = data['YearManufacture'].astype(int)
years = ['2023','2022','2021','2020','2019','2018','2017','2016','2015','2014']
data[years] = data[years].replace('[z]',0).astype(int)

Both FirstYearUsed and YearManufactured have some values of [x] where the data was unavailable; given I’ll be doing some date-based filtering later, I need these columns to be numerical, and hence replaced the [x] with 0. Similarly, some of the year columns had values of [z], which means the data was not applicable; again, I replaced with 0.

Melting and pivoting

Now to transform the data. First, melt it, so all the years become a single column:

data_melt = data.melt(
    id_vars=['BodyType','Make','GenModel','Model','YearFirstUsed','YearManufacture','LicenceStatus'],
    var_name='Year',
    value_name='Count'
)
data_melt['Year'] = data_melt['Year'].astype(int)

This makes the Year column an object; I need it as an integer, hence the astype.

Now the data looks like:

BodyType	Make	GenModel	Model	YearFirstUsed	YearManufacture	LicenceStatus	Year	Count
Cars	ABARTH	ABARTH 124	124 GT MULTIAIR	2019	2019	Licensed	2023	5

Next, pivot based on the LicenseStatus, so Licensed and SORN become separate columns:

data_pivot = data_melt.pivot_table(
    index=['BodyType','Make','GenModel','Model','YearFirstUsed','YearManufacture','Year'],
    columns='LicenceStatus',
    values='Count',
    aggfunc='sum'
).reset_index()
data_pivot['Licensed'] = data_pivot['Licensed'].fillna(0)
data_pivot['SORN'] = data_pivot['SORN'].fillna(0)

Without reset_index(), the index= columns become indexes, making life more complicated. A new NaNs appeared, for when a vehicle had data for Licensed or SORN but not both - it makes sense to, at least for this anlysis, assume there were none. Now I have:

BodyType	Make	GenModel	Model	YearFirstUsed	YearManufacture	Year	Licensed	SORN
Buses and coaches	AIXAM	AIXAM MODEL MISSING	MISSING	2007	2007	2014	6	0

Nope, I have no idea what an Aixam is either. Apparently it’s a French microcar manufacturer. A microcar bus/coach? Maybe one of these (pic from Wikipedia):

aixam

Kinda cute. In 2014, there were six licensed, and zero SORN, of the model manufactured in 2007.

It’s also worth noting that some of the Model data is MISSING. In fact, there are 842,360 (out of 5,442,280) MISSINGs, so about 15%. 816 out of 877 Makes have at least one MISSING. So this data is not complete - although there’s not much I can do about this right now. I was curious to see the most-registered MISSING:

data_pivot[data_pivot['Model'] == 'MISSING'].sort_values('Licensed', ascending=False)

BodyType	Make	GenModel	Model	YearFirstUsed	YearManufacture	Year	Licensed	SORN
Other vehicles	OTHER BRITISH	OTHER BRITISH MODEL MISSING	MISSING	0	0	2022	53515	3540

Very curious.

Initial stats

What are the most-Licensed and most-SORN vehicles? The code is quite simple (just a sort_values()), so I won’t include it here, but it’s in the repo.

BodyType	Make	GenModel	Model	YearFirstUsed	YearManufacture	Year	Licensed	SORN
Cars	FORD	FORD FIESTA	FIESTA ZETEC	2014	2014	2014	62883	131

Not a big surprise; you see them everywhere. What about for 2023 (the latest year with data)?

BodyType	Make	GenModel	Model	YearFirstUsed	YearManufacture	Year	Licensed	SORN
Cars	FORD	FORD FIESTA	FIESTA ZETEC	2014	2014	2023	59497	466

The exact same car - which kinda makes sense. Over the 9 years, about 2700 have disappeared (presumably scrapped), and 300 have been SORN.

The most SORN one overall was a MISSING (Honda motorcycle, if you’re curious). Ignoring the MISSINGs:

BodyType	Make	GenModel	Model	YearFirstUsed	YearManufacture	Year	Licensed	SORN
Light goods vehicles	FORD	FORD TRANSIT	TRANSIT 350 LWB	2005	2005	2023	3595	3490

Good old Transit, the white van man’s favourite. Fun fact, I once spent a few months living in a Transit (yes, out of choice).

Searching the data

Searching is easily done through str.contains. Decide the column to look in, and the phrase to look for. This will show all the unique Models including that phrase, with the count:

search_col = 'Model'
search_phrase = r'FIESTA'

model_search_data = data[data[search_col].str.contains(search_phrase, case=False, na=False)]
results = model_search_data[search_col].unique()

print(model_search_data[search_col].nunique())
for i, c in enumerate(results):
    print(f"{i}: {c}")

Okay, there are 373 Fiestas. Wow. Changing the search to FIESTA ST returns “only” 58, including my old car:

7: FIESTA ST-3 TURBO

I can use this ID to get the data:

choice = 7
model_data = data[data[search_col] == results[choice]]
model_data_pivot = data_pivot[data_pivot[search_col] == results[choice]]

If I take the head() of this data:

BodyType	Make	GenModel	Model	YearManufacture	Year
Cars	FORD	FORD FIESTA	FIESTA ST-3 TURBO	2016	2014
Cars	FORD	FORD FIESTA	FIESTA ST-3 TURBO	2016	2015
Cars	FORD	FORD FIESTA	FIESTA ST-3 TURBO	2016	2016
Cars	FORD	FORD FIESTA	FIESTA ST-3 TURBO	2016	2017
Cars	FORD	FORD FIESTA	FIESTA ST-3 TURBO	2016	2018

I might use this data later.

Some manufactures have the same model name (especially if it’s just a number); as such, it can be better to use GenModel instead of Model for the search_col. Of course, in some cases, to get the exact data you need, you may need to search by make and model.

Licensed to SORN ratio

Moving on, I wanted to see the ratio between Licensed and SORN. A high ratio means far more vehicles are on the road than off, implying it’s popular; a low ratio, with more SORN than licensed vehicles, implies a high number of (likely older) vehicles that haven’t been destroyed, but aren’t on the road.

data_pivot["LS_Ratio"] = data_pivot["Licensed"] / data_pivot["SORN"]
data_pivot["LS_Ratio"] = data_pivot["LS_Ratio"].replace([np.inf, -np.inf], np.nan)

The second line is needed to overcome if there are no SORN counts, as a number divided by 0 is infinite. Sorting to find the max and min of this ratio gives us:

BodyType	Make	GenModel	Model	YearFirstUsed	YearManufacture	Year	Licensed	SORN	LS_Ratio
Cars	FORD	FORD FOCUS	FOCUS ZETEC EDITION	2017	2017	2017	7985	1	7985

BodyType	Make	GenModel	Model	YearFirstUsed	YearManufacture	Year	Licensed	SORN	LS_Ratio
Motorcycles	DUCATI	DUCATI 900 SS	900 SS	0	1989	2015	0	1	0

OK, this should have been expected. Any vehicle with a single SORN will have a high ratio, and any with none licensed with have a ratio of 0 (from the infinite result of the calculation). Picking an arbitrary value of over 1000 Licensed for the low ratios:

BodyType	Make	GenModel	Model	YearFirstUsed	YearManufacture	Year	Licensed	SORN	LS_Ratio
Motorcycles	HONDA	HONDA MODEL MISSING	MISSING	1981	1981	2019	1006	4883	0.206021

Let’s remove the MISSINGs:

BodyType	Make	GenModel	Model	YearFirstUsed	YearManufacture	Year	Licensed	SORN	LS_Ratio
Cars	LAND ROVER	LAND ROVER DISCOVERY	DISCOVERY TDI	1995	1995	2018	1223	3125	0.39136

Probably a bunch of farmers with some rusty Landies.

And over 1000 SORNs for the high ratio:

BodyType	Make	GenModel	Model	YearFirstUsed	YearManufacture	Year	Licensed	SORN	LS_Ratio
Cars	FORD	FORD FIESTA	FIESTA ZETEC CLIMATE	2007	2007	2019	25694	1031	24.92144

Still Ford Zetecs, but the baby brother.

I wanted to calculate the LS ratio over a multi-year period. The best way to do this was the group the data (to ensure it was only the exact same vehicle, as ungrouped the 2023 data for one vehicle would be next to the 2014 data for a different vehicle), shift the column (so the LS_ratio_-1 column for year 2023 was the LS_ratio for 2022), then take the 3- and 5-year mean.

data_pivot_grouped = data_pivot.groupby(['Make','GenModel','Model','YearManufacture'])

for n in [1,2,3,4]:
    data_pivot[f'LS_Ratio_-{n}'] = data_pivot_grouped['LS_Ratio'].shift(n)
    data_pivot[f'LS_Ratio_-{n}'] = data_pivot[f'LS_Ratio_-{n}'].fillna(0)

data_pivot['LS_Ratio_Mean_5yr'] = data_pivot[['LS_Ratio', 'LS_Ratio_-1', 'LS_Ratio_-2', 'LS_Ratio_-3', 'LS_Ratio_-4']].mean(axis=1)
data_pivot['LS_Ratio_Mean_3yr'] = data_pivot[['LS_Ratio', 'LS_Ratio_-1', 'LS_Ratio_-2']].mean(axis=1)

The first result is the Aixam again (alphabetical), but that’s mostly zeros, so let’s look at a Fiesta Zetec:

BodyType	Make	GenModel	Model	YearFirstUsed	YearManufacture	Year	Licensed	SORN	LS_Ratio	LS_Ratio_-1	LS_Ratio_-2	LS_Ratio_-3	LS_Ratio_-4	LS_Ratio_Mean_5yr	LS_Ratio_Mean_3yr
Cars	FORD	FORD FIESTA	FIESTA ZETEC	2014	2014	2023	59497	466	127.676	156.426	178.7353	197.1974	247.8992	181.5868	154.2791

Looks like the LS ratio is decreasing over time, and is currently below the 3- and 5-year mean, suggesting more are being SORN (or, if SORN is constant, fewer are being Licensed).

Plots

If you’ve seen any of my other posts, you know I like plots, so I built a function that plots the key values over time. The input df is best as a single car, such as from the search above. The function creates two plots. The first one is a line chart, with the Licensed and SORN counts over time on one axis, and the LS ratio on a second:

def plots(df):
    car_data_grouped_by_year = df[['Year', 'Licensed', 'SORN']]
    car_data_grouped_by_year = car_data_grouped_by_year.groupby('Year').sum()

    fig, ax1 = plt.subplots(figsize=(10,6))
    ax2 = ax1.twinx()
    plt.title("Count of licensed and SORN vehicles, and ratio, per year")

    sns.lineplot(data=car_data_grouped_by_year, x="Year", y="Licensed", ax=ax1, label="Licensed", color="b", errorbar=None, estimator=None)
    sns.lineplot(data=car_data_grouped_by_year, x="Year", y="SORN", ax=ax1, label="SORN", color="g", errorbar=None, estimator=None)
    ax1.set_ylabel("Licensed & SORN", color="black")
    ax1.set_xlabel("Year")

    car_data_grouped_by_year['LS_Ratio_Group'] = car_data_grouped_by_year['Licensed'] / car_data_grouped_by_year['SORN']
    
    sns.lineplot(data=car_data_grouped_by_year, x="Year", y="LS_Ratio_Group", ax=ax2, label="Ratio", color="r", errorbar=None)
    ax2.axhline(1, color='black', linewidth=0.8, linestyle='--')
    ax2.set_ylabel("Licensed to SORN Ratio", color="red")

    lines1, labels1 = ax1.get_legend_handles_labels()
    lines2, labels2 = ax2.get_legend_handles_labels()
    lines = lines1 + lines2
    labels = labels1 + labels2
    plt.legend(lines, labels, loc='best')

    plt.show()

The data needs to be grouped to get the sum by year (regardless of year of manufacture, for example). As taking the sum of the LS ratio doesn’t make sense - not any other form of aggregation, really - I recalculated it with the summed values. The legends for both axes are merged into one.

The second plot, in the same function, is a bar plot giving the number of licensed vehicles by manufacture year (again, requiring a groupby and sum):

    car_data_grouped_by_manu_year = df[df['Year'] == 2023][['YearManufacture','Licensed']].groupby('YearManufacture').sum()

    plt.figure(figsize=(11,6))
    plt.title("Count of licensed cars per year of manufacture")
    sns.barplot(car_data_grouped_by_manu_year, x=car_data_grouped_by_manu_year.index, y='Licensed')
    plt.xlabel("Year of Manufacture")
    plt.xticks(rotation=45)
    plt.show()

Using the ST-3 search example above:

plots(model_data_pivot)

The first chart shows a steady increase licensed STs, and a slow increase in SORNs. The ratio drops dramatically at first, as expected, as a small increase in SORN results in a large percentage change (the curse of percentages).

The second chart shows how 2019 was the most popular year of manufacture, with huge drops in 2020 and 2021 (thanks COVID). Side note, a new generation was released in 2017; I wonder if that could be a cause of the drop in 2018 (for example, waiting to see if the new car is good). There’s no way of separating generation in this dataset.

By adding display() into the plots() function, we can see the data used to build the plots:

Year	Licensed	SORN	LS_Ratio_Group
2014	2242	2	1121
2015	5440	7	777.1429
2016	9106	33	275.9394
2017	12281	78	157.4487
2018	13461	98	137.3571
2019	16454	138	119.2319
2020	17443	174	100.2471
2021	17093	219	78.05023
2022	19102	304	62.83553
2023	20959	455	46.06374

YearManufacture	Licensed
2014	1759
2015	2647
2016	3154
2017	2803
2018	1392
2019	3239
2020	1215
2021	276
2022	2055
2023	2419

Older vehicles

Time for a specific question. What 20th century vehicles are still on the road?

display(
    data_pivot[(data_pivot['BodyType'] != 'Motorcycles') & (data_pivot['BodyType'] != 'Light goods vehicles') & (data_pivot['BodyType'] != 'Other vehicles') & (data_pivot['YearManufacture'] < 2000) & (data_pivot['Model'] != 'MISSING') & (data_pivot['Licensed'] > 100) & (data_pivot['Year'] == 2023)]
    .drop(['Year','LS_Ratio_-1','LS_Ratio_-2','LS_Ratio_-3','LS_Ratio_-4'], axis=1)
    .sort_values('LS_Ratio_Mean_5yr', ascending=False)
    .head(5)
)

This will give cars manufactured before 2000, excluding the MISSINGs, with more than 100 licensed in 2023. Drop the unnecessary columns, and sort by the 5-year mean LS ratio:

Make	GenModel	Model	YearFirstUsed	YearManufacture	Licensed	SORN	LS_Ratio	LS_Ratio_Mean_5yr	LS_Ratio_Mean_3yr
MG	MG MGB	B	1981	1980	159	5	31.8	48.96	80.6
MG	MG MGB	B GT	1981	1980	164	5	32.8	33.52667	55.54444
MG	MG MGB	B	1974	1973	101	8	12.625	15.925	15.65278
MG	MG MGB	B	1966	1966	113	13	8.692308	9.127772	9.193473
MG	MG MGA	A	1959	0	171	22	7.772727	8.135448	7.689065

All MGs! Even now, 45 years later, there are 30 times as many MGBs on the road than are SORN. However, the 2023 LS ratio is lower than the 3- and 5-year means, suggesting more are going off the road (that said, given the SORN figure is 5, a jump from 4 to 5 would dramatically change the ratio, so the change in ratio may not be as significant as it looks).

Let’s group by the model, regardless of the year of manufacture (as long as it’s pre-2000). When grouping, you have to define how to aggregate the other columns; I’ll take the sum of Licensed and SORN (giving all the vehicles), and the mean of the ratios (as summing ratios is nonsensical):

display(
    data_pivot[(data_pivot['BodyType'] != 'Motorcycles') & (data_pivot['BodyType'] != 'Light goods vehicles') & (data_pivot['BodyType'] != 'Other vehicles') & (data_pivot['YearManufacture'] < 2000) & (data_pivot['Model'] != 'MISSING') & (data_pivot['Licensed'] > 100) & (data_pivot['Year'] == 2023)]
    .drop(['GenModel','YearFirstUsed','YearManufacture','Year','LS_Ratio_-1','LS_Ratio_-2','LS_Ratio_-3','LS_Ratio_-4'], axis=1)
    .groupby('Model', as_index=False)
    .agg({'BodyType':'first', 'Make':'first', 'Licensed':'sum', 'SORN':'sum', 'LS_Ratio':'mean', 'LS_Ratio_Mean_5yr':'mean', 'LS_Ratio_Mean_3yr':'mean'})
    .sort_values('LS_Ratio_Mean_5yr', ascending=False)
    .head(5)
)

Note how taking the mean LS ratio means it’s not the same as the sum of Licensed divided by sum of SORN - it’s typically close, but sometimes but always.

Model	Make	Licensed	SORN	LS_Ratio	LS_Ratio_Mean_5yr	LS_Ratio_Mean_3yr
MINI MINOR	MORRIS	108	15	7.2	6.695833	6.986111
TR2	TRIUMPH	136	22	6.181818	6.093074	6.139971
A	MG	1042	192	5.757499	5.948549	5.850808
MUSTANG GT	FORD	210	35	6.238095	5.926916	6.240947
B	MG	12297	4302	4.823928	5.338588	6.451258

The MGs are still there, but now a Triumph joins the pack. Also, a Mustang and a Mini. For those who’ve never seen the classic Mustang-vs-Mini races, watch this:

Next, the same search, but grouping by the GenModel, giving a slightly broader view - in this query, for example, all the MGBs in the table two above would be grouped:

display(
    data_pivot[(data_pivot['BodyType'] != 'Motorcycles') & (data_pivot['BodyType'] != 'Light goods vehicles') & (data_pivot['BodyType'] != 'Other vehicles') & (data_pivot['YearManufacture'] < 2000) & (data_pivot['Model'] != 'MISSING') & (data_pivot['Licensed'] > 100) & (data_pivot['Year'] == 2023)]
    .drop(['BodyType','Model','YearFirstUsed','YearManufacture','Year','LS_Ratio_-1','LS_Ratio_-2','LS_Ratio_-3','LS_Ratio_-4'], axis=1)
    .groupby('GenModel', as_index=False)
    .agg({'Make':'first', 'Licensed':'sum', 'SORN':'sum', 'LS_Ratio_Mean_5yr':'mean', 'LS_Ratio_Mean_3yr':'mean'})
    .sort_values('LS_Ratio_Mean_5yr', ascending=False)
    .head(5)
)

GenModel	Make	Licensed	SORN	LS_Ratio_Mean_5yr	LS_Ratio_Mean_3yr
TRIUMPH TR2	TRIUMPH	136	22	6.093074	6.139971
MG MGA	MG	1042	192	5.948549	5.850808
FORD MUSTANG	FORD	325	64	5.644761	5.678685
TRIUMPH TR5	TRIUMPH	298	55	5.267664	5.602357
TRIUMPH TR3	TRIUMPH	122	28	5.016905	4.644841

More Triumphs join the mix!

What have we learnt so far? In the UK, there are a surprising number of classic Triumphs and MGs from the 60s, 70s, and 80s still on the road - at least, relative to the ones which are parked up and not roadgoing. This doesn’t particularly surprise me - I regularly see old MGs and Triumphs. However, I didn’t expect them to be the most popular of the vintage classics.

Phoenixes

Now to focus on the second question - which old cars are having a comeback? I decided to call these phoenixes - those which died, but have come back.

Again, I was only interested in cars from the 20th century:

phoenix_data = data_pivot[(data_pivot['YearManufacture'] < 2000) & (data_pivot['Model'] != 'MISSING')].sort_values('Make', ascending=True)

And again, I got rid of MISSINGs.

How to find the phoenixes within this data? My idea uses the LS ratio. An LS ratio going from less than 1 (more SORN than Licensed) to greater than 1 (more Licensed than SORN) suggests that these SORN vehicles are being relicensed.

I started by looping through all the unique vehicles in the data:

for (make, genmodel, model, yearmanufacture), model_data in phoenix_data[['Make','GenModel','Model','YearManufacture','Year','Licensed','SORN']].groupby(['Make','GenModel','Model','YearManufacture']):

This takes the columns I care about, groups them, then iterates through. I included grouping by year of manufacture to see if a specific year is particularly coming back.

First thing within the loop is regrouping the data.

    model_data_grouped = model_data.groupby(['Make','GenModel','Model','YearManufacture','Year'], as_index=False).sum()

Then, as above, recreate the LS Ratio:

    model_data_grouped["LS_Ratio_Group"] = model_data_grouped["Licensed"] / model_data_grouped["SORN"]
    model_data_grouped["LS_Ratio_Group"] = model_data_grouped["LS_Ratio_Group"].replace([np.inf, -np.inf], np.nan)

Now to create the LS ratio pass-through-1 described above, which I called Switch:

    model_data_grouped['Switch'] = (model_data_grouped['LS_Ratio_Group'].transform(lambda x: (x > 1) & (x.shift(1) < 1)))

If the current LS Ratio is > 1, and the previous one is < 1, the Switch column is True; otherwise, it is False.

With the individual vehicle group data ready, I can check if it satisfies defined Phoenix criteria:

    if ((model_data_grouped['Switch'].sum() >= switch) 
        & (model_data_grouped['Licensed'] <= min_licensed_lt).any() 
        & (model_data_grouped[model_data_grouped['Year']==2023]['Licensed'] >= now_licensed_gt).any() 
        & (model_data_grouped[model_data_grouped['Year']==2023]['Licensed'] <= now_licensed_lt).any() 
        & (model_data_grouped['LS_Ratio_Group'][9] > (ls_ratio_multiplier * model_data_grouped['LS_Ratio_Group'].mean()))
        & (model_data_grouped['LS_Ratio_Group'] <= ls_ratio_lt).any()
        & (model_data_grouped['LS_Ratio_Group'] >= ls_ratio_gt).any()
        & (model_data_grouped['LS_Ratio_Group'][9] >= ls_ratio_2023_gt).any()
        ):

These are all controlled by the variables:

switch = 1  # set to 1 to enable, set to 0 to disable
min_licensed_lt = 10  # set very high to "deactivate"
now_licensed_gt = 50  # set to 0 to "deactivate"
now_licensed_lt = 999999  # set very high to "deactivate"
ls_ratio_multiplier = 1  # set to 0 to "deactivate"
ls_ratio_lt = 0.5  # set very high to "deactivate"
ls_ratio_gt = 0  # set to 0 to "deactivate"
ls_ratio_2023_gt = 9999  # set to 0 to "deactivate"

Going through these tests one by one:

Did the LS ratio pass through 1? → A True is equivalent to 1, so, when summing up the Switch column, any value greater than or equal to 1 implies there was at least one switch.
Were fewer than # ever licensed at any time? → Suggesting there was a period when they “died”
Were more than # licensed in 2023? → Suggesting they’re back
Were fewer than # licensed in 2023? → To avoid super-popular vehicles
Is the LS ratio for 2023 (relating to 9) more than (# times) the average LS ratio for that vehicle? → Suggesting a comeback
Was the LS ratio ever below a certain value? → Suggesting they “died”
Was the LS ratio ever above a certain value? → Suggesting they are/were popular
Was the LS ratio above a certain value in 2023? → Suggesting they are popular

Finally, print findings, and add the data to a list for easy checking later.

            print(f"🐦‍🔥? {make} / {genmodel} / {model} / {yearmanufacture}")
            phoenixes.append({f"{make} / {genmodel} / {model} / {yearmanufacture}" : model_data_grouped})

Based on the variables above, there were a number of potentials - here’s just a few:

🐦‍🔥? CHEVROLET GMC / CHEVROLET GMC CORVETTE / CORVETTE / 1975
🐦‍🔥? NORTON / NORTON COMMANDO / COMMANDO 850 / 1974
🐦‍🔥? TALBOT / TALBOT SUNBEAM / SUNBEAM LOTUS / 1981

Yeah, there are motorbikes in this data too! A quick search online and Sunbeams are going for £40k, so it does seem that this is turning up some old-but-expensive cars, as I was hoping to find.

If I want to see and/ot plot the data for an individual car I can get it from the phoenixes list. The key is the same as the printed statement, for easy copypaste. For example, the Talbot:

search_phrase = r'TALBOT / TALBOT SUNBEAM / SUNBEAM LOTUS / 1981'
phoenix_car = next((item for item in phoenixes if search_phrase in item), None)[search_phrase]
display(phoenix_car)
plots(phoenix_car)

Make	Model	YearManufacture	Year	Licensed	SORN	LS_Ratio_Group	Switch
TALBOT	SUNBEAM LOTUS	1981	2014	13	52	0.25	FALSE
TALBOT	SUNBEAM LOTUS	1981	2015	14	51	0.27451	FALSE
TALBOT	SUNBEAM LOTUS	1981	2016	14	58	0.241379	FALSE
TALBOT	SUNBEAM LOTUS	1981	2017	13	65	0.2	FALSE
TALBOT	SUNBEAM LOTUS	1981	2018	9	68	0.132353	FALSE
TALBOT	SUNBEAM LOTUS	1981	2019	11	70	0.157143	FALSE
TALBOT	SUNBEAM LOTUS	1981	2020	9	71	0.126761	FALSE
TALBOT	SUNBEAM LOTUS	1981	2021	11	73	0.150685	FALSE
TALBOT	SUNBEAM LOTUS	1981	2022	48	59	0.813559	FALSE
TALBOT	SUNBEAM LOTUS	1981	2023	60	54	1.111111	TRUE

The numbers aren’t huge, but the number of Licensed has risen from ~10 to ~60 in only a couple of years, and SORN numbers have dropped by almost 20. Quite a comeback! I am curious where that other ~20 came from - perhaps forgotten in a barn (so neither Licensed nor SORN), or perhaps they came from the MISSINGs. You can also see where the LS ratio crossed the magical 1.0 line, the same time the Licensed and SORN lines crossed.

For whose who don’t know, this is the Sunbeam:

talbot sunbeam

Image from https://classicsworld.co.uk/cars/talbot-sunbeam-lotus-road-test/

Different phoenixes

Of course, tweaks could be made to this search to get different results. For example:

Adjusting the variables
Add new checks, such as max number of SORN
Searching all manufacture years (by removing YearManufacture from the initial grouping)
Changing the cut-off year in the original data (for example, “modern classics”, vehicles pre 2010)

For example, by searching for cars with the LS ratio changeover and between 500 and 5000 Licensed in 2023 I found vehicles such as:

🐦‍🔥? LAND ROVER / LAND ROVER 88 / 88-4 CYL / 0
🐦‍🔥? TRIUMPH / TRIUMPH SPITFIRE / SPITFIRE / 0
🐦‍🔥? VOLKSWAGEN / VOLKSWAGEN 1300 / 1300 / 0

Searching for a 2023 LS ratio of over double the mean (vs 1x times in the original search), with over 500 Licensed vehicles in 2023, but at one time fewer than 500 included:

🐦‍🔥? MG / MG MGB / B / 1980
🐦‍🔥? MG / MG MGB / B GT / 1980

Vehicles where at one time the LS ratio was ≤0.5 (i.e. twice as many SORN as Licensed), but in 2023 the ratio is ≥2 (i.e. twice as many Licensed and SORN), with at least 20 currently Licensed (to avoid 0/1 → 1/0 etc), included a lot of classics, including:

🐦‍🔥? ALFA ROMEO / ALFA ROMEO 2000 / 2000 GT VELOCE / 1974
🐦‍🔥? JENSEN / JENSEN JENSEN-HEALEY / JENSEN-HEALEY / 1975
🐦‍🔥? LANCIA / LANCIA FULVIA / FULVIA / 1975
🐦‍🔥? PONTIAC / PONTIAC FIREBIRD / FIREBIRD / 1977
🐦‍🔥? RELIANT / RELIANT SCIMITAR / SCIMITAR GTC AUTO / 1980

You get the idea! As I’ve said, my script is available on GitHub (link at the top of this page), and the data is all on the DVLA website (again at the top of this page), so have a play yourself.

Some other classics

Given I had the data already processed, I also wanted to see the data for a few cars I know are classics. These didn’t show up in the searches for various reasons, mostly because the LS ratio never changed over - some have always had more Licensed than SORN; most have always had more SORN than Licensed (at least since 2014). A couple - the Supra, the Elise - did cross over, albeit in the wrong direction, so it would have failed the LS ratio > mean test. Regardless, some interesting trends:

Jaguar E-Type

Aston Martin DB5

Lotus Elise Series 1

Peugeot 205 GTi

Ford Escort RS Cosworth

Mercedes-Benz 190E (Cosworth)

BMW E30 M3

VW Golf GTI Mk1

Porsche 911 (pre-1998)

Toyota Supra

Mazda RX-7

Conclusion

This was fun! Using publicly-accessible data, transforming it, calculating new columns, plotting to observe trends, and then a custom search for vehicles fulfilling specific criteria.

Out of everything I’ve discovered, the old MGs and Triumphs really stood out. Several models have gone from more-SORN to more-Licensed, and they exist in high enough volumes that it’s actually feasible to buy one. Others - such as the Jensen or Lancia or Alfa - only had one or two for sale, or fewer, and they’re not cheap. Looking on Auto Trader, MGB GTs vary from £4k to £20, depending on condition; Triumph Stags are £10k to £30k. I’d be curious to see how the prices of these change over the coming years, especially with the move towards EVs - maybe there will be more demand for these small, retro British cars.

I went on to create a dashboard for this data using Power BI. You can see the post here: https://www.jamesgibbins.com/dvla-data-dashboard/

Analysing DVLA Data for Vintage Comebacks

Table of Contents