This is the 22:nd post in a series about Microsoft SQL Server R Services. This journey started about a year ago when I thought it would be a good idea to write a couple of blog posts about, what then went under the name of, Microsoft SQL Server 2016 R Services. Never in my wildest fantasies did I think I would write over 20 posts.
NOTE: Since the introduction of Microsoft SQL Server R Services, it has been renamed to Microsoft SQL Server Machine Learning Services.
To see other posts (including this) in the series, go to SQL Server R Services.
This series began with a discussion about how to install Microsoft SQL Server R Services and continued with the internals of how it works internally in SQL Server and the launchpad service. Throughout the series, we have used
sp_execute_external_script when investigating what goes on. So, I thought that having a couple of posts looking at
sp_execute_external_script would be a good way to finish this series off, and this post is the first of three about
Before we “dive” into today’s topics let us look at the code and the tools we use today. This section is here for those of who want to follow along in what we are doing in the post.
To help us figure out the things we want, we use Process Monitor, and WireShark:
- Process Monitor, is used it to filter TCP traffic. If you want to refresh your memory about how to do it, we covered that in Internals - X as well as in Internals - XIV.
- I use WireShark for network packet sniffing. If you want a refresher about WireShark, we covered the setup and so forth in Internals - X.
The code below sets up the database and creates some tables with some data. The data is based on the Iris dataset:
Code Snippet 1: Creating Database Objects and Generating Data
The code in Code Snippet 1 should be fairly self explanatory. The reason why we split up the full Iris dataset is to later on in this post handle multiple datasets.
The first thing to do is to remind ourselves why we need
sp_execute_external_script at all? Well, as opposed to SQLCLR, the external script engine is not embedded in SQL Server, but sits outside of SQL Server and runs in its own process - it is, (wait for it…), external :). So, if the external engine runs outside of SQL Server there needs to be a bridge between SQL and the engine - and
sp_execute_external_script is that bridge.
In most of the blog posts in this series we have used
sp_execute_external_script something like so (or with variants thereof):
Code Snippet 2: Use of sp_execute_external_script
When we compare Code Snippet 2 with what the full signature of
sp_execute_external_script, as in Code Snippet 3, we see that we have used
sp_execute_external_script in a somewhat simplistic way:
Code Snippet 3: Full Signature of sp_execute_external_script
In a majority of the blog posts, the only parameters used are:
@script and sometimes
@input_data_1, but as we see in Code Snippet 3, there are more:
So, what are these parameters intended use? Let us find out.
By now we should know what the
@language parameter is for. It is to tell the launchpad service which launcher dll to initialise, which in turn targets the correct external engine. At the time of writing, there are two external engines which we can use; R and Python, and I would expect Microsoft to add other engines over time. From my side, I would love to see being able to call into .NET - outside of the SQL Engine.
Figure 1: External Script and Language
Figure 1 illustrates how the launchpad service chooses which launcher dll to launch based on the
NOTE: Microsoft SQL Server R Services - Internals II covers more about the launchpad service and different types of launchers.
@language parameter is a required parameter.
@script parameter defines the code the external engine executes, and is a required parameter:
Code Snippet 4: Example of @script Parameter
In Code Snippet 4 we use the Iris dataset, and we calculate the mean of the sepal width for the setosa species. We pass the script as a literal in the
@script parameter. Passing the script as a literal is quite common, and you can see it in most examples of
sp_execute_external_script. It is however not required to pass the script as a literal; it can be passed as a variable as well as loaded from a source file. Code snippets 5 and 7 show examples of this:
Code Snippet 5: Script Passed in via a Variable
So, in Code Snippet 5 we see how the script is passed in as a variable (
@scriptParam). A use case for this can be that you have scripts stored in a table in a database, and when you want to execute the script you load the particular script into a variable.
Onto loading a script via a source file; let us assume we have an R source file named
iris_r.r with code like so:
Code Snippet 6: R Source File Content
The file is at:
c:\rscripts. The way to execute
sp_execute_external_script based on this looks like so:
Code Snippet 7: Script Passed in Through Source File
We see in Code Snippet 7 how we use the R command
source to load the source code. The
source command causes R to accept its input from the named file. Input is read and parsed from that file until the end of the file is reached, then the parsed expressions are evaluated sequentially in the chosen environment. When you use the
source command, like in Code Snippet 7, the file path needs to be accessible by the external engine.
You may note how, in Code Snippet 7, the file path is defined like you would do in Linux:
c:/rscripts/iris_r.r. To do it this way is not required, you can use Windows path definitions, but you have to escape the
\, like this:
Oh, something to keep in mind is that the values for
@params all have to be explicit
nvarchar. What I mean with that is when you assign a value to the particular parameter, it has to be
DECLARE:ed as an
nvarchar. An example of this is Code Snippet 5 and the
@scriptParam parameter. If the parameter value is not
DECLARE:ed as an
nvarchar, it has to prepended with
N as we see for example in Code Snippet 7:
@language = N'R' and
@script = N'source("c:/rscripts/iris_r.r"). If you do not
DECLARE or prepend with
N you get an error like this:
Figure 2: Error non nvarchar Parameter Value
One final thing about the
@script parameter. Remember how we, back in Internals - X, discussed how SQL Server sends the actual script to the SqlSatellite via a socket connection? When we supply the script through a file name, the data sent is:
source("c:/rscripts/iris_r.r"). It may be obvious, but I thought I would mention it regardless.
NOTE: Later on in this post we talk some more about what SQL Server sends to the SqlSatellite.
@input_data_1 parameter, which is optional, specifies the input data used by the external script. You can only use a T-SQL
SELECT statement to generate the data; so no stored procedure calls, but
SELECT against views or table-valued functions work fine:
Code Snippet 8: Using @input_data_1 with straight SELECT
In Code Snippet 8 wee set how the
@input_data_1 parameter contains a
SELECT statement against the
dbo.tb_irisdata_full table, and the statement executes during the
sp_execute_external_script execution (in Internals - XIII we discussed when the statement was executed). The dataset generated by the query is referred to in the script as
InputDataSet, and we discuss the naming later in this post. The
SELECT does not have to be against a table, it can be, as I mentioned above, against a view or a user-defined function as well.
So this is straight-forward so far, but what about if you need multiple datasets loaded into the external engine? My first thought when I saw
sp_execute_external_script and the
@input_data_1 parameter, was that if there is an
@input_data_1 parameter, then there surely must be
@input_data_n parameters as well. Well, there are not, bummer, so what do we do?
A thought would be to try and execute multiple statements from the same parameter and then in the script, parse out the different datasets, something like this: `@input_data_1 = N’SELECT * FROM dbo.tb_irisdata_even; SELECT * FROM dbo.tb_irisdata_uneven’. Good suggestion but no cigar, because when you try - this happens:
Figure 3: Multi Statement Error
The error we see in Figure 3 is not very descriptive, but what it means is that you cannot have multiple
SELECT statements generating multiple datasets. What you can do however is to use
Code Snippet 9: SELECT with UNION ALL
The code in Code Snippet 9 uses the
UNION ALL to combine the results of two or more queries into a single dataset. However, as useful as that is, sometimes that does not work: what if the two queries do not contain the same data types, or if you need two (or more) individual data-sets? In that case, the solution is to retrieve data from inside the script (in addition to the input data from
To retrieve data from within the external script, we can use
ODBC (in R it is the
RODBC package) or the highly optimized Microsoft
RevoScaleR package. In the following example I use
Code Snippet 10: Multiple Data Sets using RevoScaleR
The code in Code Snippet 10 shows how we read in data within the script by the use of
RevoScaleR functionality together with the data from the
@input_data_1 parameter, and how we subsequently calculate
mean on the two data-sets.
Cool, we have now seen how we pass data into the external script via the
@input_data_1 parameter, and how multiple data-sets can be used by querying data from inside the script. What about returning data to the caller, you know - like we quite often do from stored procedures?
You may now wonder what this section has to do with returning data to the caller? Don’t worry; we get there soon!
So, there are some parameters in
sp_execute_external_script that we have not used yet, and
@output_data_1_name, which both are optional, are two of them.
We use this parameter to explicitly name the variable in the script that contains the data we pass into the external engine. But, hang on, we have not used the
@input_data_1_name parameter in the various examples in this post where we pass data to the engine, and it still works? Ah, look in code snippets 8, 9 and 10, and you see a line of code:
iris_dataset <- InputDataSet. Here
InputDataSet defines the input data, and it so happens that the default value of
So, let us see what happens if we change some things. Copy the code in Code Snippet 8 and change it to this:
Code Snippet 11: Invalid Input Name
As you see in Code Snippet 11 we assign a name to the
@input_data_1_name parameter, but in the script we still use
InputDataSet. When we execute the following error happens:
Figure 4: Non Existent Input Data Object
So the error is that the script cannot find the
InputDataSet object. Change the line:
iris_dataset <- InputDataSet to:
iris_dataset <- MyDataSet, and execute again. Now all is OK.
How does the script know what the input data is, e.g. how does it know in Code Snippet 11 that the data belongs to the
MyDataSet variable? Well this comes back to, at least indirectly, what we discussed in Internals - X and how SQL Server sent the script, plus other information, to the SqlSatellite.
Just for fun, let us revisit some of what we did in Internals - X:
- Run Process Monitor as admin and load the filter we used in Internals - X where we filtered for
- Execute the code in Code Snippet 8 and look at the output from Process Monitor to see what packets SQL Server sends:
Figure 4: Process Monitor Output
The outlined packet in Figure 4 is the packet containing the script SQL Server sent to the SqlSatallite. Notice the size of the packet and also make a note of the
Path column and the last value (in the figure it is 50887). That value is the port with which SQL Server communicates with the SqlSatellite. We use this now in WireShark to try and see if we can get any more information about what is happening.
So, let us switch over to WireShark:
- Run WireShark as admin.
- Choose the network adapter to “sniff” on. See Internals - X for discussion around loop-back adapters etc.
- Set a display filter on the port SQL Server listens on (the port you saw in the
Pathcolumn). In this case, we want to sniff outgoing packets, and - if we used what we saw in Figure 4 - the filter should subsequently be:
- Apply the filter.
Execute the code in Code Snippet 8 again, and look at the output from WireShark. In the output is a packet with the same length as the highlighted packet in Figure 4. This packet is the packet SQL Server sends to the SqlSatellite with the script (as discussed above). When we look at the data part of that packet as a hex dump, the 64 first bytes look like this:
Code Snippet 12: WireShark Output I
In Code Snippet 8 we did not define a name for the input data set, but in the script, we used
InputDataSet as a variable name - as that is the default. When looking at the hex dump in Code Snippet 12 we see the name
InputDataSet 38 bytes into the packet. So it looks like the name we give the data set is being part of the script data SQL Server sends to the SqlSatellite. We can potentially prove that theory by executing the code in Code Snippet 11 (remember to change the code that refers to the data set to:
iris_dataset <- MyDataSet) and see what WireShark outputs:
Code Snippet 13: WireShark Output II
Ok, so when we look at the code in Code Snippet 13 we see how the input data set name that we assigned through the
@input_data_1_name parameter is part of the packet.
So we can answer the question how the script knows what variable contains input data by pointing to how SQL Server passes the name of the input data variable as part of the script. Worth noticing is how, in the two packets above, at 36 bytes into the packet, the size of the variable name (in double bytes) plus two bytes are defined. In Code Snippet 12 it is hex:
1a (decimal 26) , and in Code Snippet 13 it is hex:
14 (decimal 20).
Finally, we get to how the script knows to return resultsets to the caller. Let us assume that you want to do some calculations on the data passed in through the
MyDataSet variable in Code Snippet 11 and then return that data. Maybe something like so:
Code Snippet 14: Return a Result Set
When you execute the code in Code Snippet 14, part of the resultset looks like this:
Figure 5: Result Set
In Figure 5 we see how a resultset comes back, and looking at the code in Code Snippet 14, and knowing about
InputDataSet, we assume that the reason a resultset comes back is that of the
OutputDataSet variable in the script. It turns out that assumption is correct.
Similar to the input data variable and its default name, the output data variable has a default name:
OutputDataSet. You can change this name by setting the
@output_data_1_name variable to a name of your choice, and reference that name in the script:
Code Snippet 15: Customised Result Set Name
In Code Snippet 15 we changed the name of the output data variable to
SepalLengthMult and referenced it in the script.
The script engine (and script) knows about the variable name in the same way as it does for the input data variable - it is part of the script packet that SQL Server sends to the SqlSatellite. Below is a partial hex dump of the packet SQL Server sends when we execute the code in Code Snippet 15:
Code Snippet 16: WireShark Output OutputData
In the hex dump in Code Snippet 16 we see the name of the output dataset variable starting at byte 60. We can also see at byte 58 how the length of the output data variable is defined, plus two bytes: hex
@output_data_1_name specifies the name of the variable in the external script that contains the data to be returned to SQL Server when the stored procedure completes. There is an implicit expectation of the data type of the output; for R the expectation is a data frame, and for Python, it is a pandas data frame.
Oh, a last thing about the output data. When you look at Figure 5 you see that the resultset has no column names. By default
sp_execute_external_script returns resultsets with unnamed columns as column names within a script are local. If you want to name columns you use the
WITH RESULTS SET clause of the
Code Snippet 17: Define Column Name
In Code Snippet 17 we defined the column coming back with a name of
LengthMultiplied and that the column should have a data type of
float. When we execute the code we see this:
Figure 6: Result Set
Looking at Figure 6 we see how we received a result set consisting of one column named
WITH RESULT SETS is not something specific for
sp_execute_external_script but an option on the
EXECUTE statement. If you are interested in what happens when you execute
sp_execute_external_script with the
WITH RESULT SETS option, I covered it in Internals - XVIII.
In this post we have looked at some of the parameters
@language- tells the launchpad service what external engine to use. At the moment R and Python are supported.
@script- defines the script which the external engine executes. The script can be loaded as a literal value, a parameter or through the R
@input_data_1- specifies the input data used by the external script in the form of a Transact-SQL query (
SELECTonly, no procedure calls). The query can only generate one dataset. If more datasets are required, they have to be retrieved from inside the script.
@input_data_1_name- specifies the name of the variable used to represent the query defined by @input_data_1. The parameter is optional, and defaults to
@output_data_1_name- specifies the name of the variable in the external script that contains the data to be returned to SQL Server when the stored procedure completes. The parameter is optional, and defaults to
All of the above parameter values have to be of the
NVARCHAR data type.
In next blog post we cover some of the remaining parameters for
If you have comments, questions etc., please comment on this post or ping me.
comments powered by Disqus