This post came about due to a question on the Microsoft Machine Learning Server forum. The question was if there are any plans by Microsoft to support more the one input dataset (
sp_execute_external_script. My immediate reaction was that if you want more than one dataset, you can always connect from the script back into the database, and retrieve data.
However, the poster was well aware of that, but due to certain reasons he did not want to do it that way - he wanted to push in the data, fair enough. When I read this, I seemed to remember something from a while ago, where, instead of retrieving data from inside the script, they pushed in the data, serialized it as an output parameter and then used the binary representation as in input parameter (yeah - this sounds confusing, but bear with me). I did some research (read Googling), and found this StackOverflow question, and answer. So for future questions, and for me to remember, I decided to write a blog post about it.
We start with a recap about how we pass/retrieve data for external scripts. In Microsoft SQL Server R Services - sp_execute_external_script - I we discussed, among other things, the
@input_data_1 parameter and how it specifies the input data used by the external script in the form of a Transact-SQL query:
Code Snippet 1: Using @input_data_1 with straight SELECT
In Code Snippet 1 we see how the
@input_data_1 parameter contains a
SELECT statement against a table, and the statement executes during the
sp_execute_external_script execution. The dataset generated by the query is referred to in the script as
InputDataSet. The query has to be based on a
SELECT, but it does not have to be against a table, it can be against a view or a user-defined function as well.
NOTE: The code in Code Snippet 1 above, and Code Snippet 2 below, uses tables and data from the Microsoft SQL Server R Services - sp_execute_external_script - I post.
That is all well and good, but what if you want to push in multiple datasets? Seeing that the parameter we use to define the data ends with a
1, it would be logical to think there are
3, and so on - but no, that is not the case. To process more than one dataset in the external script, we have to pull the dataset(s) from inside the script. To do this, we can use
ODBC (in R it is the
RODBC package, in Python
pyodbc) or the highly optimized Microsoft
RevoScaleR package (in Python
revoscalepy). In the following example I use
Code Snippet 2: Multiple Data Sets using RevoScaleR
The code in Code Snippet 2 shows how we read in data within the script by the use of
RevoScaleR functionality together with the data from the
@input_data_1 parameter, and how we subsequently calculate
mean on the two datasets.
So above we see how we can use multiple datasets, but what if we for some or another do not want to / cannot connect from the script to the database, as per the question above? Well, that is what we cover in this post.
What we want to do is to push data from three completely different tables into the external script, and then in the script do something. We use
@input_data_1 to push one of the datasets, and we look at two different ways to push in the data from the other two tables. The tables we read data from are three system tables, this way we do not need to create separate tables and load data etc.:
What we want to do in the script is to use R/Python functionality to print out the number of rows that we push in from the separate tables, so we start with some code like this:
Code Snippet 3: Pushing Data via @input_data_1 in R
There is nothing strange in Code Snippet 3, we see how we do the
@input_data_1, and how we in the script:
- Assign the
InputDataSetto a variable:
- Use the R function
nrowto get the number of rows in the data frame (
- Print it out to the console.
When we execute the code in Code Snippet 3, the result is like so:
Figure 1: Input Data 1 Result
From Figure 1 we see that the dataset contains 1070 rows.
If we want to do this in Python the code looks like this:
Code Snippet 4: Pushing Data via @input_data_1 in Python
Nothing strange in Code Snippet 4 either. The one noteworthy thing is the use of
len(df.index), instead of
shape. I use
len(df.index) as “people in the know” says it performs better. When we execute the code in Code Snippet 4, we get the same result as we see in Figure 1.
Ok, so the code in the two code snippets above, (3 and 4), is the base code for what we want to do. Now we need to figure out how to push two more datasets into the scripts, and there are two ways to do that:
- Binary serialization.
Before we look at those two ways, let us discuss briefly a requirement for us to be able to do what we want, and that is the use of the
@params parameter in
@params parameter is an optional parameter, and when defined it contains a comma separated list of the definitions of all parameters embedded in the values for the
@input_data_1 and the
@script parameters. The string must be either a Unicode constant or a Unicode variable. Each parameter definition consists of a parameter name and a data type. The parameters defined in the
@params list need to be added as named parameters to the stored procedure, and in the case of parameters for the external script; the script references the parameters by name but without the
So, when we push data into the script, either by using JSON or binary serialization we define a parameter for the data which we then reference in the script.
NOTE: If you want to get the inner workings of the
@paramsparameter, have a look at my blog post: Microsoft SQL Server R Services - sp_execute_external_script - II.
Enough of this preamble, let us get going.
With the release of SQL Server 2016, Microsoft added support for JSON text processing. Microsoft added JSON functions to SQL Server, which enable you to analyze and query JSON data, transform JSON to relational format, and export SQL query results as JSON text. A typical query producing JSON text can look like this:
Code Snippet 5: Retrieve JSON Data
We see how I use the
FOR JSON AUTO syntax in Code Snippet 5 to indicate to SQL Server I want JSON formatted data as the resultset. You do not necessarily need to use
AUTO, as there are other options. To see more of this look here: Format Query Results as JSON with FOR JSON (SQL Server).
I limit the columns I select to get a more readable output. When I execute and click on the result I see:
Figure 2: JSON Result
So in Figure 2 we see how my
SELECT query results in JSON data.
To solve our problem how to push in multiple datasets, we can now use the JSON formatted data together with the
@params parameter to define two parameters containing JSON. For example,
Code Snippet 6: Python Push JSON
In Code Snippet 6 we see how we:
- Declare two variables:
@sysDbs, both of type
nvarchar(max), and how we load data into them.
- Declare two parameters in the
- Define the two parameters and assign
If we were to execute, all would work - but we have not done anything with the parameters in the script. What we need to do is to parse the incoming JSON text into a data frame somehow. To do this in Python, we use the
pandas package as it has various functions for parsing JSON.
pandaspackage makes it easy to work with relational data. To find out more about it, go to here.
Anyway, the function from the
pandas package we use is
Code Snippet 7: Python Parse JSON
The code in Code Snippet 7 shows how we:
- Import the
pandaspackage and give it an alias:
- Assign the
InputDataSetparameter to the
dfvariable as before.
We then do the “heavy lifting” (or rather
pandas does), where we transform the JSON text to data frames, by the use of
read_json. From the three data frames, we finally print out the number of rows per table. The result when we execute looks like so:
Figure 3: Result Python read_json
It all works! Oh, it just so happens that I am on an almost new SQL Server instance, and that is why the number of rows in
sys.databases is the same.
If R is your “poison of choice”, then we can, for example, use the
jsonlite package. You can read more about it here. What we use from
jsonlite is the
Code Snippet 8: Parse JSON in R
The code in Code Snippet 8, looks very similar to what is in Code Snippet 7, and in Code Snippet 8 we:
- Load the
- Assign the
InputDataSetparameter to the
dfvariable as before.
- Parse the JSON into data frames using
We have now seen how we, both in Python as well as R, can use JSON to push multiple datasets into external scripts, and - as I mentioned above - JSON is one way of doing it. Now onto the next.
When we use the binary serialization method of pushing multiple datasets, we use R and Python’s built-in functionality for serialization and deserialization. In Python, it is with the help of the
pickle module, and in R the
Initially, binary serialization looks somewhat more complicated than JSON (and it might be), especially since the deserialization happens against a binary parameter serialized with R or Python. In other words, we need to make a roundtrip to R/Python to get the binary representation of the data, so having a helper procedure to do this sounds like a good idea to me:
Code Snippet 9: R Serialization Helper
In Code Snippet 9 we see an R serialization helper procedure:
- It takes a query as an input parameter (
- It has an output parameter which is the serialized result set.
The body of the procedure is a call to
sp_execute_external_script, where the
@query parameter acts as
@input_data_1, and we have a defined output parameter
@serRes. In the script, we call the R
serialize function on the pushed in dataset, and assigns it to the output parameter. The flow:
- We pass in a query statement.
- During the call to
sp_execute_external_scriptthe query is run, and the resulting data set passed into the external script.
- The external script serializes the dataset and passes it back out as an output parameter.
The equivalent Python serialization helper looks like so:
Code Snippet 10: Python Serialization Helper
We see in Code Snippet 10 how we bring in the
pickle module, and then serialize the dataset with the
The equivalent of Code Snippet 7 using the Python serialization helper looks like so:
Code Snippet 11: Implementation of Python Serialization
In Code Snippet 11 we see how we:
- Declare the variables for the queries as well as the serialized results of the queries.
- Execute the helper procedure for the two queries we want the results serialized for.
- Have defined two
varbinary(max)parameters in the
- Assign the serialized values to those two parameters.
sp_execute_external_scriptand send in the two serialized results as well as
sp_execute_external_scriptwe deserialize the results using
pickle.loads, or rather
pis the alias for
The code for an R implementation looks almost the same except that we call the R
unserialize function instead of
pickle.loads, as per the code below:
Code Snippet 12: Implementation of R Serialization
We see in Code Snippet 12 how we push in one resultset, and we do not use
@input_data_1. Instead, we serialize the resultset from the query with the R helper procedure and then deserialize with the
So, we have now seen two ways of pushing in datasets to an external script: JSON and Binary Serialization. Which should you choose - I mean, JSON seems a lot easier? The answer comes down to performance.
To look at performance let us create a database, a table and some data:
Code Snippet 13: Database with Table and Data
We see above, in Code Snippet 13, how we create a database, with a table and we load one million records into the table. The data is entirely random and fairly useless, but it serves its purposes.
NOTE: If you code along and use the code above, please also create
dbo.pr_SerializeHelperPy, which you see in Code Snippet 10, in the
Having created the database and the objects, let us look at the code we use to compare performance. The code below is for JSON:
Code Snippet 14: JSON Performance Code
And here is the binary serialization code:
Code Snippet 15: Binary Serialization Performance Code
The code in snippets 14, and 15 is a variant of what we have seen so far. In these two snippets, we only push in one table, and we look at the time it takes to do it. We see how we
dbo.tb_Rand1M, and initially, we do a
When I highlight both code snippets and run them a couple of times to not incur “startup” costs the results are:
Figure 4: Performance 100 Rows
That’s quite interesting; we see in Figure 4 how JSON serialization is around twice as fast as binary serialization. Ok, what if we did it on 1000 rows? With 1000 rows JSON is still about twice as fast. However, when we tun it against the full table (one million rows), the results are different:
Figure 5: Performance a Million Rows
In Figure 5 we see how with a million rows, the binary serialization is about 3.5 times faster than JSON. There are a couple of reasons why binary serialization performs better that JSON with larger datasets:
- Binary serialization is more compact, less data to transfer.
- By using the
@input_data_1parameter to push in the data to the serialization we get a better performing transport. Read more about it in my post: Microsoft SQL Server R Services - Internals XIV.
So, for very small datasets, use the JSON method, but for larger datasets, the binary serialization is always preferred. Another thing to keep in mind is if you use both
@input_data_1 as well as pushing in serialized data (like in all our code snippets where we used three tables), try to use
@input_data_1 for the biggest dataset. That way you get the better performing transport and also, potentially, parallel execution of the query.
We have now seen how we can push in multiple datasets to an external script, without having to connect from the script back to the database. We can use two methods:
- Binary serialization.
When we use JSON we utilize SQL Server’s JSON capabilities to execute a query and receive the result as JSON formatted text:
SELECT ... FROM ... FOR JSON AUTO. In the external script we then deserialize the JSON using:
- In R the
fromJSONfunction in the
- In Python the
read_jsonfunction in the
When we do binary serialization we use R/Python’s capabilities to both serialize as well as deserialize data. This means we need to do a roundtrip to R/Python to serialize the data. The tools we use:
- In R we call
- In Python we use functions from the
picklepackage. To serialize we call
dumpsand to deserialize we use
What method to use (JSON or binary serialization) comes down to, in my mind, performance. For very small datasets JSON is faster, but as soon as the dataset gets bigger binary serialization outperforms JSON by order of magnitude.
If you have comments, questions etc., please comment on this post or ping me.
comments powered by Disqus