Salesforce is working in a new feature called Data Pipeline that help us to integrate Apache Pig into Salesforce. Previous link can give you lot of information about what is Apache Pig but basically it is a Open Source technology that gives you mechanism for parallel processing of MapReduce jobs in a Hadoop cluster.
Here we can find 2 main keywords:
- MapReduce: Software framework to write programs to execute large amount of unstructured data in parallel.
- Hadoop Cluster: A special type of cluster that helps you to analyze and store large amount of unstructured data.
At the end we will say that:
Data Pipeline will help you to execute processes with large amount of data in parallel in Salesforce.
Ok, I know what you are thinking right now. It sounds interesting but, I can get the same with other Salesforce features like @future, Queueable or Batch Apex. And that’s right. But the main thing here is that performance is much better and instead of having to wait maybe an hour in order to complete an asynchronous execution, Data Pipeline provides results in few minutes.
And how can we use it in Salesforce? We only need to create a single Pig Script using Pig Latin language (forget Apex) via Developer Console. And once we have it, click on Submit Data Pipeline button to execute the process.
If you already know Apache Pig probably you know how to use Pig Latin and this pilot will be quite simple for your. But if this is not your case, I would like to start with 2 examples.
One of them will show you how to insert records and avoid getting an unexpected situation due to a validation.
The second one will help us to create records related by a Master – Detail relationship.
1st Use Case – How to create records avoiding breaking Salesforce rules
I have a new custom object called Header__c and I need to create some records taking Opportunity records as resource. The Pig script will look like this one. Something simple, where I retrieve some fields from Opportunity object and after doing a filter by Opportunity Name, store the result in the new Header custom object record.
As you can see, the result is successful.
And this is the new record in the list view
But what does it happen if Header object has a validation that avoid any insertion if amount is not greater than 100? This is the validation rule:
And if we run the same script, but using Opportunity2 as a source, this is what developer console returns.
Ok, what is going on? The process finishes successfully. And what about the list view?
Same as before. Just a single line. So how can we explain this?
We need to remember that Data Pipeline is going to execute records in batch, so if there is any failure during the script execution, and a record cannot be inserted, the process will continue with the rest, and finish successfully. We could compare it with Batch Apex, where the Apex Job screen shows us that an execution has finished properly without errors even if this means that some records where not inserted due to some validations that we have added in our code.
And how can we avoid this situation and not break the rules? Imagine this use case. This time we have 2 opportunities with the same name, but one of them has an amount less than 100.
If I execute above script, just filtering by the new Opportunity Names, just one of them will be inserted
But do we have a nicer and cleaner way to do it? We can also filter by amounts so we will avoid hitting the validation:
Getting the below result
2nd Use Case – Create a Master-Detail relationship
Another common situation is the use case where we need to create two records related by a Master-Detail relationship.
With above code examples we can think that this is not a big deal, but we have to keep in mind that this Pig Scripts doesn’t provide handle errors on the current release, and if for any reason the header is not created and we try to create lines related to them, the process can fail without our knowledge (as we show before).
So how can we solve this situation? Salesforce advice is to create 2 scripts, so after running the one that create Headers, we can run another that will create records of my new custom object called Line__c.
What can we find on above script?
- First of all we look for all Headers that we have in the Organization right now.
- Look for all Opportunity Lines we have in the Organization as well.
- Execute a Join by Opportunity Id, as both, Header__c and OpportunityLineItem have this field.
- Finally I will sort the joined result, because we don’t need all fields for my new Line__c registers.
- And Store, save, the information in the custom object.
And this is the result:
Yes, so simple. Something to keep in mind? List fields on Store in the same order as we have defined in the variable that we will use for the population.
But looking at this code, we can think, why don’t do it in the same script? Remember that the process is asynchronous, so by the time we query Header__c records, maybe they are not inserted yet in the organization.
Anything to highlight?
- Salesforce Data Pipeline is still in Pilot so in order to use it you need to ask Salesforce to give you access to it.
- When will it be GA? My latest news, Summer ’16 but as always, Safe Harbor.
- Cost? Price? Yes, it will have an extra cost apart from the licenses, but I don’t have the information to tell you how much.
- Salesforce ensures that these executions would be faster than other asynchronous processes like Batch Apex or @Future.
- Running Apache Pig into Salesforce help us to do this execution in a multi-tenant environment. However if we run a process in Apache Pig outside of Salesforce, executions are for a single user.
- As we are not using Apex, we will not hit governor limits.
- We will be able to add these scripts into packages so our customers can also run these processes.
- We cannot create scripts with more than 20 operators, 20 loads and 10 stores.
- Although we will not hit governor limits, we need to keep in mind that Store (Insert / Update) is done via Bulk API, so any restriction still apply.
- Unfortunately we cannot call Pig Scripts from Apex. Only from Developer Console and from API 33.0 onwards, during deployment.
- It doesn’t offer a debug tool, so if we are in the first use case and a record is not inserted, it would not be easy to find the reason quickly.
- You cannot handle errors easily, it means, you cannot add something in your code like try / catch.
But this is not all. If you want to know more about Data Pipeline, you can also look at the presentation that Carolina Ruíz and myself run at Dreamforce ’15. Find the video Data Pipelines: Big Data Meets Salesforce.