Process billions of records with Async SOQL

Christmas time is closer but I would like to deliver this last post before end of the year.

Some months ago, I wrote an entry talking about BigObjects, a feature that was General Available by Winter ’18 and via a use case, I tried to explain it. Now, I would like to follow that post with a new way to create records, Async SOQL.

Async SOQL is half GA by Winter ’18 and half still in Pilot 

The use case talked about moving Code Review custom object records, into Code Review History big objects records, and release storage of custom objects one. We want to Archive records, and Async SOQL will help us to deal with big amounts of data.

What is Async SOQL?

Basically Async SOQL allows you to run SOQL in the background so that, you will get a response after a period of time, but at the same time, helps you to deal with millions or even billions of records without hitting time outs or governor limits.

How can I execute Async SOQL?

Async SOQL is implemented as REST full API and in order to execute this SOQL in the background we need to run a post request:

Captura de pantalla 2017-11-29 a las 21.51.22

And provide a body in JSON format in order to make the call. Find bellow a really easy example that helps you to create Vendor__c custom object records based on Accounts.

Captura de pantalla 2017-12-19 a las 9.40.12What can we highlight?

query: allows you to define a SOQL. On our case, the object I’m going to read, Account and the field I want to recover, Name.

operation: help us to define what we want to do, insert or upsert (please check big objects entry to understand Primary Key and upsert, as it works in the same way)

Basically that is the main difference between simple SOQL and Async SOQL. With this new feature we can read and create or update records in one go.

targetobject: object we want to use to create new records. In our case, Vendor__c.

Then, we need to specify from where we want to get record field values and where we want to store this info. For that we have 2 keys in this JSON code.

targetFieldMapping: allows you to define source field and target fields

and

targetValueMap: allows you to map a target field with a literal value, for instance, we want all records have Spain as Country__c field value.

How is the response?

When we do the post call, we get a similar response:

Captura de pantalla 2017-11-29 a las 22.01.12

It is also in JSON format and basically it provides similar information that you passed in the post call. Just highlight 3 keys:

jobId: remember this is an asynchronous call so there is an Id related to the background job. You can also use it in the post call via the global variable $Job_Id as part of key value on targetValueMap.

status: helps you to know where is the job. Moving from New to Running, Complete, Failed etc.

message: before executing the call, the code is analyzed. If it realizes about a possible issue, it doesn’t start the call and provides an error message in that field. For instance, below message. My VendorName__c field size was too short to store all Account Names.

Captura de pantalla 2017-11-29 a las 22.11.03.png

How can I stop the execution?

Async SOQL doesn’t provide a UI in Salesforce like other background jobs like Batch Apex or Queueable. But we can make an http delete call passing the jobId as part of the url.

Captura de pantalla 2017-11-29 a las 22.22.02

How can I check the progress?

Similar as before, as we do not have UI, we need to look for another way in order to get this information. For that, we have 2 options:

1º – Make a get post call: Similar to the cancel action, if we pass the JobId as part of the url, and do a get call, the response shows you information abut the background execution that is running in the system.

Captura de pantalla 2017-11-29 a las 22.32.33

2º – Make a SOQL against BatckgroundOperation object: This is a new object where we can see extra information about the job execution.

Captura de pantalla 2017-11-29 a las 22.32.42

But this is not the only new object that we can get in the system. We also have BackgroundOperationResult that will help you to identify any issue during the execution if the result is not the expected one. For instance. Bellow image shows an issue related to a field that is required but the source value is empty. In that case, a new error is logged on this table, but the execution doesn’t stop, it continues till the end.

Just keep in mind that this information would be removed after 7 days.

Captura de pantalla 2017-11-29 a las 22.32.55

And what about Archiving?

Yes, you are right. We promised to show you how to archive Code Review records but tried to explain the whole functionality with a really simple use case. Now it’s time to move to Code Review History use case.

As I mentioned before, we would like to create Code Review History records reading Code Review records. Similar as before we will do a post call but the body would be like this one:

Captura de pantalla 2017-11-29 a las 22.51.40

On BigObjects post, the object was called Test1CodeReviewHistory__b instead of CodeReviewHistory__b

And … that’s all. Simple, isn’t it?

Can I integrate this functionality into Apex?

Yes, of course, at the end we are just making http calls, so we only need to set the proper url, add it on my remote settings and create a string in JSON format for the body.

Captura de pantalla 2017-11-29 a las 22.42.57

Summary

Finally I would like to sum up some key concepts we have talked about.

  1. Async SOQL allows you to run SOQL in the background
  2. It takes some times but allows you to process millions or even billions of records
  3. You do not need to worry about governor or time out limitations
  4. It is implemented as REST full API
  5. You can make as many calls as you want per day but just a single one at a time
  6. You can read and create records in one go. Delete is out of scope
  7. This feature is part GA and part still in Pilot:
    1. Read Standard or Custom objects and create Standard, Custom or Big Objects is in Pilot
    2. Read Big Objects and create Standard, Custom or Big Objects is GA

Captura de pantalla 2017-11-29 a las 22.43.44

Data Pipeline and Salesforce – Do not break the rules

Captura de pantalla 2015-09-14 a las 16.19.20

Salesforce is working in a new feature called Data Pipeline that help us to integrate Apache Pig into Salesforce. Previous link can give you lot of information about what is Apache Pig but basically it is a Open Source technology that gives you mechanism for parallel processing of MapReduce jobs in a Hadoop cluster.

Here we can find 2 main keywords:

  • MapReduce: Software framework to write programs to execute large amount of unstructured data in parallel.
  • Hadoop Cluster: A special type of cluster that helps you to analyze and store large amount of unstructured data.

At the end we will say that:

Data Pipeline will help you to execute processes with large amount of data in parallel in Salesforce. 

Ok, I know what you are thinking right now. It sounds interesting but, I can get the same with other Salesforce features like @future, Queueable or Batch Apex. And that’s right. But the main thing here is that performance is much better and instead of having to wait maybe an hour in order to complete an asynchronous execution, Data Pipeline provides results in few minutes.

And how can we use it in Salesforce? We only need to create a single Pig Script using Pig Latin language (forget Apex) via Developer Console. And once we have it, click on Submit Data Pipeline button to execute the process.

Captura de pantalla 2015-09-11 a las 18.29.00

If you already know Apache Pig probably you know how to use Pig Latin and this pilot will be quite simple for your. But if this is not your case, I would like to start with 2 examples.

One of them will show you how to insert records and avoid getting an unexpected situation due to a validation.

The second one will help us to create records related by a Master – Detail relationship.

1st Use Case – How to create records avoiding breaking Salesforce rules

I have a new custom object called Header__c and I need to create some records taking Opportunity records as resource. The Pig script will look like this one. Something simple, where I retrieve some fields from Opportunity object and after doing a filter by Opportunity Name, store the result in the new Header custom object record.

Captura de pantalla 2015-09-10 a las 10.49.50

As you can see, the result is successful.

Captura de pantalla 2015-09-10 a las 10.50.03

And this is the new record in the list view

Captura de pantalla 2015-09-11 a las 17.38.16

But what does it happen if Header object has a validation that avoid any insertion if amount is not greater than 100? This is the validation rule:

Captura de pantalla 2015-09-10 a las 10.52.07

And if we run the same script, but using Opportunity2 as a source, this is what developer console returns.

Captura de pantalla 2015-09-11 a las 17.43.32

Ok, what is going on? The process finishes successfully. And what about the list view?

Captura de pantalla 2015-09-11 a las 17.38.16

Same as before. Just a single line. So how can we explain this?

We need to remember that Data Pipeline is going to execute records in batch, so if there is any failure during the script execution, and a record cannot be inserted, the process will continue with the rest, and finish successfully. We could compare it with Batch Apex, where the Apex Job screen shows us that an execution has finished properly without errors even if this means that some records where not inserted due to some validations that we have added in our code.

And how can we avoid this situation and not break the rules? Imagine this use case. This time we have 2 opportunities with the same name, but one of them has an amount less than 100.

Captura de pantalla 2015-09-11 a las 17.54.59

If I execute above script, just filtering by the new Opportunity Names, just one of them will be inserted

Captura de pantalla 2015-09-11 a las 18.05.29

But do we have a nicer and cleaner way to do it? We can also filter by amounts so we will avoid hitting the validation:

Captura de pantalla 2015-09-11 a las 18.22.44

Getting the below result

Captura de pantalla 2015-09-11 a las 18.26.18

2nd Use Case – Create a Master-Detail relationship

Another common situation is the use case where we need to create two records related by a Master-Detail relationship.

With above code examples we can think that this is not a big deal, but we have to keep in mind that this Pig Scripts doesn’t provide handle errors on the current release, and if for any reason the header is not created and we try to create lines related to them, the process can fail without our knowledge (as we show before).

So how can we solve this situation? Salesforce advice is to create 2 scripts, so after running the one that create Headers, we can run another that will create records of my new custom object called Line__c.

Captura de pantalla 2015-11-16 a las 19.34.36

What can we find on above script?

  • First of all we look for all Headers that we have in the Organization right now.
  • Look for all Opportunity Lines we have in the Organization as well.
  • Execute a Join by Opportunity Id, as both, Header__c and OpportunityLineItem have this field.
  • Finally I will sort the joined result, because we don’t need all fields for my new Line__c registers.
  • And Store, save, the information in the custom object.

And this is the result:

Captura de pantalla 2015-11-16 a las 19.41.25

Yes, so simple. Something to keep in mind? List fields on Store in the same order as we have defined in the variable that we will use for the population.

But looking at this code, we can think, why don’t do it in the same script? Remember that the process is asynchronous, so by the time we query Header__c records, maybe they are not inserted yet in the organization.

Anything to highlight?

  • Salesforce Data Pipeline is still in Pilot so in order to use it you need to ask Salesforce to give you access to it.
  • When will it be GA? My latest news, Summer ’16 but as always, Safe Harbor.
  • Cost? Price? Yes, it will have an extra cost apart from the licenses, but I don’t have the information to tell you how much.
  • Advantages?
    • Salesforce ensures that these executions would be faster than other asynchronous processes like Batch Apex or @Future.
    • Running Apache Pig into Salesforce help us to do this execution in a multi-tenant environment. However if we run a process in Apache Pig outside of Salesforce, executions are for a single user.
    • As we are not using Apex, we will not hit governor limits.
    • We will be able to add these scripts into packages so our customers can also run these processes.
  • Disadvantages?
    • We cannot create scripts with more than 20 operators, 20 loads and 10 stores.
    • Although we will not hit governor limits, we need to keep in mind that Store (Insert / Update) is done via Bulk API, so any restriction still apply.
    • Unfortunately we cannot call Pig Scripts from Apex. Only from Developer Console and from API 33.0 onwards, during deployment.
    • It doesn’t offer a debug tool, so if we are in the first use case and a record is not inserted, it would not be easy to find the reason quickly.
    • You cannot handle errors easily, it means, you cannot add something in your code like try / catch.

But this is not all. If you want to know more about Data Pipeline, you can also look at the presentation that Carolina Ruíz and myself run at Dreamforce ’15. Find the video Data Pipelines: Big Data Meets Salesforce.