Laravel Upserts: How to batch import data efficiently

Released on 13 October 2020, Laravel 8.10 came with the addition of the new upsert() method on the database query builder (and therefore Eloquent based queries too). This allows you to perform "insert or update" queries far more efficiently - which will help when importing large batches of data.

A practical example of using upsert()

Up until now, the Laravel database query builder supported the insertion or updating of multiple rows (batches of data) at once via the insert() and update() methods. There was also a lesser known insertOrIgnore() method which would insert a batch of data, ignoring any errors (such as a unique key already existing).

We also had the updateOrCreate() method on Eloquent models which given a set of attributes, updated an existing model, or if it didn't exist, it created it. However, this could only be used on a single row / model at a time - if we had 125,000 rows to insert or update, we'd have to call that Eloquent method 125,00 times. Behind the scenes, this actually executes two queries in the database - one to check for the existing entry and another to perform the INSERT or UPDATE - so that would mean running 250,000 queries. That could be slow.

So let's imagine we have a CSV file containing data for 125,000 products that we needed to import into our database as efficiently as possible. No problem, we can use the insert() method on the query builder to insert multiple rows at the same time.

Fast forward to 4 weeks later, we have a new CSV file of product data - the details of the 125,000 products have been updated and 25,000 new products have been added. In our database we need to update the existing products and insert the new ones. To do this, we need to know which products we already have, and which are new so we can perform either an INSERT or an UPDATE query.

Rather than using the updateOrCreate() 150,000 times for each row in the CSV file, let's start by breaking down the CSV file into smaller batches, of say 5,000 products. That gives us 30 batches.

We can then pass each batch of product data to the the new upsert() method on the Illuminate\Database\Query\Builderclass, which for each row of data in the CSV file will either update an existing product in our database, or create a new one. This way we will only execute 30 queries rather than 300,000 if we'd used the updateOrCreate()method (remember, it's actually 2 queries for each use of it).

So how does upsert() work?

I'm going to assume you're using MySQL, although it does work with other database engines.

Following on from our example of a products database, simplified down for a batch of just 3 products, this is the SQL that would be run:

INSERT INTO products
(item_ref, title, price)
VALUES
(1, 'Large cake', 15.00),
(1, 'Medium cake', 10.00),
(1, 'Small cake', 5.00)
ON DUPLICATE KEY
UPDATE
item_ref = values(item_ref),
title = values(title),
price = values(price);

As you can see, the query starts with a normal INSERT statement and it will attempt to insert the first set of data values into the table. If that's successful, it will remove on to the next set. However, if that fails because the unique identifier (in our example, that's "item_ref") already exists (i.e the row of data is already in the table), then it will be caught by the ON DUPLCIATE KEY part of the query. The SQL following that, in this case an UPDATE statement, will then get run. This will update the existing database row with the set of values. Once complete, it will move onto the next set of data values.

What about the performance of upsert()?

To test the performance of the new upsert() method, I started a new Laravel project and created a simple artisan command that performs 3 tests - one using the updateOrCreate method, one using the upsert() method for singular rows, and one for using upsert() method with batches of rows (5,000 at a time).

For each test, an array of 125,000 "products" are written to an empty table - measuring the time it takes to complete. Every item in the original array of product data is then updated with a new price, and a further 25,000 products added. This updated raw data is then also written to the database whilst being timed.

I spun up a new server on Linode(with PHP 7.4 & MySQL 8) and ran the artisan command 5 times. I averaged out the times from each test, which ended up as follows:

Test 1 (updateOrCreate() method)

INSERT: 280.84 seconds
INSERT / UPDATE: 398.76 seconds

Test 2 (upsert() method - singular rows)

INSERT: 203.99 seconds
INSERT / UPDATE: 270.39 seconds

Test 3 (upsert() method - batched rows)

INSERT: 15.38 seconds
INSERT / UPDATE: 17.97 seconds

As expected the first tests with updateOrCreate() method was slowest. The second test, usingupsert() for each product one-at-a-time was a little quicker as it only executed 1 query for each product, rather than 2. Finally, the third test, using upsert() for batches (5,000) of products was vastly quicker than the other two tests. In my tests, it works out at ~95% quicker for bothINSERT queries and INSERT / UPDATE queries when compared to updateOrCreate().

That's great, how can I use upsert()?

The method is available of the Laravel database query builder, and the Eloquent query builder. It accepts 3 arguments:

upsert(array $values, $uniqueBy, $update = null)

$values

The first argument is an array of arrays containing the data to be inserted / updated - one child array for each Eloquent model / database row. For example:

$values = [
    [
        'item_ref'  => 1,
        'title'     => 'Large cake',
        'price'     => 15.00,
    ],
    [
        'item_ref'  => 2,
        'title'     => 'Medium cake',
        'price'     => 10.00,
    ],
    [
        'item_ref'  => 3,
        'title'     => 'Small cake',
        'price'     => 5.00,
    ]
];

If you're using upsert() on an Eloquent model that uses timestamps, the created_at and updated_at attributes will automatically be handled.

$uniqueBy

The second argument describes how to identify a unique row.

This could be the name of a single column (passed as a string, or an array containing a single string), or if you have a composite key (two or more columns used to uniquely identify a row), the names of multiple columns passed as an array. For example:

$uniqueBy = 'item_ref';
// OR
$uniqueBy = ['item_ref'];
// OR
$uniqueBy = ['item_type', 'item_size'];

All database engines except SQL Server require the columns in the $uniqueBy argument to have a "primary" or "unique" index.

$update

The third and final argument, which is optional, defines the columns that should be updated if a matching row already exists in the database. This allows you to include data in your $values argument that will be written only on an INSERT. By default all columns in your $values are updated, but here is an example if you do want to provide this argument:

$update = ['title', 'price'];

Wrapping up

I hope you've found this useful - if you do want to know more you can read the docs (although I've covered everything and more in this blog post) or you can review the original PRs on GitHub: #34698, #34712.

I'm @gbuckingham89 on Twitter if you have any questions about using upsert() or anything else Laravel related.