This is the 22:nd post in a series about Microsoft SQL Server R Services. This journey started about a year ago when I thought it would be a good idea to write a couple of blog posts about, what then went under the name of, Microsoft SQL Server 2016 R Services. Never in my wildest fantasies did I think I would write over 20 posts.
NOTE: Since the introduction of Microsoft SQL Server R Services, it has been renamed to Microsoft SQL Server Machine Learning Services.
To see other posts (including this) in the series, go to SQL Server R Services.
This series began with a discussion about how to install Microsoft SQL Server R Services and continued with the internals of how it works internally in SQL Server and the launchpad service. Throughout the series, we have used sp_execute_external_script
when investigating what goes on. So, I thought that having a couple of posts looking at sp_execute_external_script
would be a good way to finish this series off, and this post is the first of three about sp_execute_external_script
.
Housekeeping
Before we “dive” into today’s topics let us look at the code and the tools we use today. This section is here for those of who want to follow along in what we are doing in the post.
Helper Tools
To help us figure out the things we want, we use Process Monitor, and WireShark:
- Process Monitor, is used it to filter TCP traffic. If you want to refresh your memory about how to do it, we covered that in Internals - X as well as in Internals - XIV.
- I use WireShark for network packet sniffing. If you want a refresher about WireShark, we covered the setup and so forth in Internals - X.
Code
The code below sets up the database and creates some tables with some data. The data is based on the Iris dataset:
|
|
Code Snippet 1: Creating Database Objects and Generating Data
The code in Code Snippet 1 should be fairly self explanatory. The reason why we split up the full Iris dataset is to later on in this post handle multiple datasets.
sp_execute_external_script
The first thing to do is to remind ourselves why we need sp_execute_external_script
at all? Well, as opposed to SQLCLR, the external script engine is not embedded in SQL Server, but sits outside of SQL Server and runs in its own process - it is, (wait for it…), external :). So, if the external engine runs outside of SQL Server there needs to be a bridge between SQL and the engine - and sp_execute_external_script
is that bridge.
In most of the blog posts in this series we have used sp_execute_external_script
something like so (or with variants thereof):
|
|
Code Snippet 2: Use of sp_execute_external_script
When we compare Code Snippet 2 with what the full signature of sp_execute_external_script
, as in Code Snippet 3, we see that we have used sp_execute_external_script
in a somewhat simplistic way:
|
|
Code Snippet 3: Full Signature of sp_execute_external_script
In a majority of the blog posts, the only parameters used are: @language
, @script
and sometimes @input_data_1
, but as we see in Code Snippet 3, there are more:
@language
@script
@input_data_1
@input_data_1_name
@output_data_1_name
@parallel
@params
@parameter1
So, what are these parameters intended use? Let us find out.
@language
By now we should know what the @language
parameter is for. It is to tell the launchpad service which launcher dll to initialise, which in turn targets the correct external engine. At the time of writing, there are two external engines which we can use; R and Python, and I would expect Microsoft to add other engines over time. From my side, I would love to see being able to call into .NET - outside of the SQL Engine.
Figure 1: External Script and Language
Figure 1 illustrates how the launchpad service chooses which launcher dll to launch based on the @language
parameter.
NOTE: Microsoft SQL Server R Services - Internals II covers more about the launchpad service and different types of launchers.
The @language
parameter is a required parameter.
@script
The @script
parameter defines the code the external engine executes, and is a required parameter:
|
|
Code Snippet 4: Example of @script Parameter
In Code Snippet 4 we use the
Iris dataset, and we calculate the mean of the
sepal width for the setosa species. We pass the script as a literal in the @script
parameter. Passing the script as a literal is quite common, and you can see it in most examples of sp_execute_external_script
. It is however not required to pass the script as a literal; it can be passed as a variable as well as loaded from a source file. Code snippets 5 and 7 show examples of this:
|
|
Code Snippet 5: Script Passed in via a Variable
So, in Code Snippet 5 we see how the script is passed in as a variable (@scriptParam
). A use case for this can be that you have scripts stored in a table in a database, and when you want to execute the script you load the particular script into a variable.
Onto loading a script via a source file; let us assume we have an R source file named iris_r.r
with code like so:
|
|
Code Snippet 6: R Source File Content
The file is at: c:\rscripts
. The way to execute sp_execute_external_script
based on this looks like so:
|
|
Code Snippet 7: Script Passed in Through Source File
We see in Code Snippet 7 how we use the R command source
to load the source code. The source
command causes R to accept its input from the named file. Input is read and parsed from that file until the end of the file is reached, then the parsed expressions are evaluated sequentially in the chosen environment. When you use the source
command, like in Code Snippet 7, the file path needs to be accessible by the external engine.
You may note how, in Code Snippet 7, the file path is defined like you would do in Linux: c:/rscripts/iris_r.r
. To do it this way is not required, you can use Windows path definitions, but you have to escape the \
, like this: c:\\rscripts\\iris_r.r
.
Oh, something to keep in mind is that the values for @language
, @script
, @input_data_1
, @input_data_1_name
, @output_data_1_name
and @params
all have to be explicit nvarchar
. What I mean with that is when you assign a value to the particular parameter, it has to be DECLARE
:ed as an nvarchar
. An example of this is Code Snippet 5 and the @scriptParam
parameter. If the parameter value is not DECLARE
:ed as an nvarchar
, it has to prepended with N
as we see for example in Code Snippet 7: @language = N'R'
and @script = N'source("c:/rscripts/iris_r.r")
. If you do not DECLARE
or prepend with N
you get an error like this:
Figure 2: Error non nvarchar Parameter Value
One final thing about the @script
parameter. Remember how we, back in
Internals - X, discussed how SQL Server sends the actual script to the SqlSatellite via a socket connection? When we supply the script through a file name, the data sent is: source("c:/rscripts/iris_r.r")
. It may be obvious, but I thought I would mention it regardless.
NOTE: Later on in this post we talk some more about what SQL Server sends to the SqlSatellite.
@input_data_1
The @input_data_1
parameter, which is optional, specifies the input data used by the external script. You can only use a T-SQL SELECT
statement to generate the data; so no stored procedure calls, but SELECT
against views or table-valued functions work fine:
|
|
Code Snippet 8: Using @input_data_1 with straight SELECT
In Code Snippet 8 wee set how the @input_data_1
parameter contains a SELECT
statement against the dbo.tb_irisdata_full
table, and the statement executes during the sp_execute_external_script
execution (in
Internals - XIII we discussed when the statement was executed). The dataset generated by the query is referred to in the script as InputDataSet
, and we discuss the naming later in this post. The SELECT
does not have to be against a table, it can be, as I mentioned above, against a view or a user-defined function as well.
So this is straight-forward so far, but what about if you need multiple datasets loaded into the external engine? My first thought when I saw sp_execute_external_script
and the @input_data_1
parameter, was that if there is an @input_data_1
parameter, then there surely must be @input_data_2
, @input_data_n
parameters as well. Well, there are not, bummer, so what do we do?
A thought would be to try and execute multiple statements from the same parameter and then in the script, parse out the different datasets, something like this: `@input_data_1 = N’SELECT * FROM dbo.tb_irisdata_even; SELECT * FROM dbo.tb_irisdata_uneven’. Good suggestion but no cigar, because when you try - this happens:
Figure 3: Multi Statement Error
The error we see in Figure 3 is not very descriptive, but what it means is that you cannot have multiple SELECT
statements generating multiple datasets. What you can do however is to use UNION
or UNION ALL
:
|
|
Code Snippet 9: SELECT with UNION ALL
The code in Code Snippet 9 uses the UNION ALL
to combine the results of two or more queries into a single dataset. However, as useful as that is, sometimes that does not work: what if the two queries do not contain the same data types, or if you need two (or more) individual data-sets? In that case, the solution is to retrieve data from inside the script (in addition to the input data from @input_data_1
).
To retrieve data from within the external script, we can use ODBC
(in R it is the RODBC
package) or the highly optimized Microsoft
RevoScaleR
package. In the following example I use RevoScaleR
:
|
|
Code Snippet 10: Multiple Data Sets using RevoScaleR
The code in Code Snippet 10 shows how we read in data within the script by the use of RevoScaleR
functionality together with the data from the @input_data_1
parameter, and how we subsequently calculate mean
on the two data-sets.
Cool, we have now seen how we pass data into the external script via the @input_data_1
parameter, and how multiple data-sets can be used by querying data from inside the script. What about returning data to the caller, you know - like we quite often do from stored procedures?
@input_data_1_name
& @output_data_1_name
You may now wonder what this section has to do with returning data to the caller? Don’t worry; we get there soon!
So, there are some parameters in sp_execute_external_script
that we have not used yet, and @input_data_1_name
and @output_data_1_name
, which both are optional, are two of them.
@input_data_1_name
We use this parameter to explicitly name the variable in the script that contains the data we pass into the external engine. But, hang on, we have not used the @input_data_1_name
parameter in the various examples in this post where we pass data to the engine, and it still works? Ah, look in code snippets 8, 9 and 10, and you see a line of code: iris_dataset <- InputDataSet
. Here InputDataSet
defines the input data, and it so happens that the default value of @input_data_1_name
is InputDataSet
.
So, let us see what happens if we change some things. Copy the code in Code Snippet 8 and change it to this:
|
|
Code Snippet 11: Invalid Input Name
As you see in Code Snippet 11 we assign a name to the @input_data_1_name
parameter, but in the script we still use InputDataSet
. When we execute the following error happens:
Figure 4: Non Existent Input Data Object
So the error is that the script cannot find the InputDataSet
object. Change the line: iris_dataset <- InputDataSet
to: iris_dataset <- MyDataSet
, and execute again. Now all is OK.
How does the script know what the input data is, e.g. how does it know in Code Snippet 11 that the data belongs to the MyDataSet
variable? Well this comes back to, at least indirectly, what we discussed in
Internals - X and how SQL Server sent the script, plus other information, to the SqlSatellite.
Just for fun, let us revisit some of what we did in Internals - X:
- Run Process Monitor as admin and load the filter we used in
Internals - X where we filtered for
TCP Receive
forBxlServer.exe
. - Execute the code in Code Snippet 8 and look at the output from Process Monitor to see what packets SQL Server sends:
Figure 4: Process Monitor Output
The outlined packet in Figure 4 is the packet containing the script SQL Server sent to the SqlSatallite. Notice the size of the packet and also make a note of the Path
column and the last value (in the figure it is 50887). That value is the port with which SQL Server communicates with the SqlSatellite. We use this now in WireShark to try and see if we can get any more information about what is happening.
So, let us switch over to WireShark:
- Run WireShark as admin.
- Choose the network adapter to “sniff” on. See Internals - X for discussion around loop-back adapters etc.
- Set a display filter on the port SQL Server listens on (the port you saw in the
Path
column). In this case, we want to sniff outgoing packets, and - if we used what we saw in Figure 4 - the filter should subsequently be:tcp.srcport==50887
. - Apply the filter.
Execute the code in Code Snippet 8 again, and look at the output from WireShark. In the output is a packet with the same length as the highlighted packet in Figure 4. This packet is the packet SQL Server sends to the SqlSatellite with the script (as discussed above). When we look at the data part of that packet as a hex dump, the 64 first bytes look like this:
|
|
Code Snippet 12: WireShark Output I
In Code Snippet 8 we did not define a name for the input data set, but in the script, we used InputDataSet
as a variable name - as that is the default. When looking at the hex dump in Code Snippet 12 we see the name InputDataSet
38 bytes into the packet. So it looks like the name we give the data set is being part of the script data SQL Server sends to the SqlSatellite. We can potentially prove that theory by executing the code in Code Snippet 11 (remember to change the code that refers to the data set to: iris_dataset <- MyDataSet
) and see what WireShark outputs:
|
|
Code Snippet 13: WireShark Output II
Ok, so when we look at the code in Code Snippet 13 we see how the input data set name that we assigned through the @input_data_1_name
parameter is part of the packet.
So we can answer the question how the script knows what variable contains input data by pointing to how SQL Server passes the name of the input data variable as part of the script. Worth noticing is how, in the two packets above, at 36 bytes into the packet, the size of the variable name (in double bytes) plus two bytes are defined. In Code Snippet 12 it is hex: 1a
(decimal 26) , and in Code Snippet 13 it is hex: 14
(decimal 20).
@output_data_1_name
Finally, we get to how the script knows to return resultsets to the caller. Let us assume that you want to do some calculations on the data passed in through the MyDataSet
variable in Code Snippet 11 and then return that data. Maybe something like so:
|
|
Code Snippet 14: Return a Result Set
When you execute the code in Code Snippet 14, part of the resultset looks like this:
Figure 5: Result Set
In Figure 5 we see how a resultset comes back, and looking at the code in Code Snippet 14, and knowing about InputDataSet
, we assume that the reason a resultset comes back is that of the OutputDataSet
variable in the script. It turns out that assumption is correct.
Similar to the input data variable and its default name, the output data variable has a default name: OutputDataSet
. You can change this name by setting the @output_data_1_name
variable to a name of your choice, and reference that name in the script:
|
|
Code Snippet 15: Customised Result Set Name
In Code Snippet 15 we changed the name of the output data variable to SepalLengthMult
and referenced it in the script.
The script engine (and script) knows about the variable name in the same way as it does for the input data variable - it is part of the script packet that SQL Server sends to the SqlSatellite. Below is a partial hex dump of the packet SQL Server sends when we execute the code in Code Snippet 15:
|
|
Code Snippet 16: WireShark Output OutputData
In the hex dump in Code Snippet 16 we see the name of the output dataset variable starting at byte 60. We can also see at byte 58 how the length of the output data variable is defined, plus two bytes: hex 20
.
So, @output_data_1_name
specifies the name of the variable in the external script that contains the data to be returned to SQL Server when the stored procedure completes. There is an implicit expectation of the data type of the output; for R the expectation is a data frame, and for Python, it is a pandas data frame.
Oh, a last thing about the output data. When you look at Figure 5 you see that the resultset has no column names. By default sp_execute_external_script
returns resultsets with unnamed columns as column names within a script are local. If you want to name columns you use the WITH RESULTS SET
clause of the EXECUTE
:
|
|
Code Snippet 17: Define Column Name
In Code Snippet 17 we defined the column coming back with a name of LengthMultiplied
and that the column should have a data type of float
. When we execute the code we see this:
Figure 6: Result Set
Looking at Figure 6 we see how we received a result set consisting of one column named LengthMultiplied
. WITH RESULT SETS
is not something specific for sp_execute_external_script
but an option on the EXECUTE
statement. If you are interested in what happens when you execute sp_execute_external_script
with the WITH RESULT SETS
option, I covered it in
Internals - XVIII.
Summary
In this post we have looked at some of the parameters sp_execute_external_script
expects:
@language
- tells the launchpad service what external engine to use. At the moment R and Python are supported.@script
- defines the script which the external engine executes. The script can be loaded as a literal value, a parameter or through the Rsource(file_name)
command.@input_data_1
- specifies the input data used by the external script in the form of a Transact-SQL query (SELECT
only, no procedure calls). The query can only generate one dataset. If more datasets are required, they have to be retrieved from inside the script.@input_data_1_name
- specifies the name of the variable used to represent the query defined by @input_data_1. The parameter is optional, and defaults toInputDataset
.@output_data_1_name
- specifies the name of the variable in the external script that contains the data to be returned to SQL Server when the stored procedure completes. The parameter is optional, and defaults toOutputDataSet
.
All of the above parameter values have to be of the NVARCHAR
data type.
In next blog post we cover some of the remaining parameters for sp_execute_external_script
.
~ Finally
If you have comments, questions etc., please comment on this post or ping me.
comments powered by Disqus