For this month’s T-SQL Tuesday, hosted by Sebastian Meine, we are asked to talk about joins.

Joins in T-SQL are really easy and efficient compared to most other languages.  My goal here is to take a simple example and show how such a join could be represented in SQL and several other languages to help appreciate how easy it is to get a highly optimized join query using SQL.

Example case:

We will be building a query to return the name of each product and the name of its associated Subcategory from the Product and ProductSubcategory tables in the AdventureWorks sample database.

For the purposes of this example, we will treat these tables either as tables or as csv files (with no header)as appropriate for the language.

SQL

1
2
3
4
SELECT P.Name AS ProductName, PSC.Name AS SubcategoryName
FROM Product AS P
LEFT JOIN ProductSubcategory AS PS
ON P.ProductSubcategoryID = PS.ProductSubcategoryID

Python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import csv
product = []
with open("Product.csv") as fh:
	reader = csv.DictReader(fh,delimiter="\t")
	for line in reader:
		product.append(line)
productsubcategory = []
with open("ProductSubcategory.csv") as fh:
	reader = csv.DictReader(fh,delimiter="\t")
	for line in reader:
		productsubcategory.append(line)
results = []
for prod in product:
	prodname = prod.get("Name")
	subcategoryid = prod.get("ProductSubcategoryID")
	subcatname = None
	if subcategoryID != '':
		for subcat in productsubcategory:
			if subcategoryid == subcat.get("ProductSubCategory"):
				subcatname = subcat.get("Name")
				break
	results.append({"ProductName":prodname,"SubcategoryName":subcatname})
print results

R

1
2
3
4
5
6
product<-read.csv('Product.csv',sep="\t",header=TRUE)
productsubcategory<-read.csv('ProductSubcategory.csv',sep="\t",header=TRUE)
results.full<-merge(product,productsubcategory,by.x="ProductSubcategoryID",by.y="ProductSubcategoryID",all.x=TRUE)
results.full.subset<-subset(results.full,select=c("Name.x","Name.y"))
names(results.full.subset)<-c("ProductName","SubcategoryName")
print(results.full.subset)

Powershell

1
2
3
4
5
6
7
8
9
10
11
12
$product = import-csv "Product.csv" -Delimiter "`t"
$productsubcategory = import-csv "ProductSubcategory.csv" -Delimiter "`t"
$product | foreach-object {
    $prod = $_
    if ($prod.ProductSubcategoryID -ne '') {
        $productsubcategory | 
            where-object {$_.ProductSubcategoryID -eq $prod.ProductSubcategoryID} |
            select-object @{Name='ProductName';Expression={$prod.name}},@{Name='SubcategoryName';Expression={$_.name}}
    } else {
        $prod | select-object @{Name='ProductName';Expression={$_.name}},@{Name='SubcategoryName';Expression={$null}}
    }
}

Pig

1
2
3
4
5
6
7
8
9
10
--For this example, we will treat the file as if the headers are removed and the newline character is converted to \n rather than \r\n
product = LOAD 'Product.csv' USING PigStorage('\t');
subcategory = LOAD 'ProductSubCategory.csv' USING PigStorage('\t');
product_project = FOREACH product GENERATE $1 AS ProductName, $18 AS ProductSubcategoryID;
subcategory_project = FOREACH subcategory GENERATE $0 AS ProductSubcategoryID,$2 AS SubcategoryName;
joined = JOIN product_project BY ProductSubcategoryID LEFT, subcategory_project BY ProductSubcategoryID;
joined_project = FOREACH joined GENERATE ProductName,SubcategoryName;
--Executing the below line will explain the logical execution plan to get to the final results
--EXPLAIN joined_project;
DUMP joined_project;

Conclusion:

Of the demonstrated languages, SQL is the only language that is fully optimized to handle relational algebra, such as joins.  In this space, SQL offers three major advantages over the other languages:

  • SQL requires the least code
  • SQL can take advantage of indexes and other metadata
  • SQL does not require you to care about the execution plan

Of the remaining languages, both R and Pig offer capabilities that abstract away parts of the execution plan, but not the entire plan.  Python and Powershell require you to manually define the entire execution plan.  Fortunately, with the exception of Pig, all of these languages have libraries available that allow you to interact with a database using SQL.

After going through the process of teaching myself python and watching others stumble through the process, it seemed prudent to assemble a set of resources that I used or wished I knew about as I learned.

Start with Python 2.7

Python 3.x may be the future, but Python 2.7 still has much more comprehensive library support.  Spend your time learning the Python rather than tackling problems with incompatible libraries and start with version 2.7.

Get a package manager

Particularly when you are starting with Python, you will frequently find yourself installing, uninstalling and searching for libraries.  By default Python comes with easy_install to help with installing new libraries, but it provides little to no support for finding new packages, or figuring out what you already have installed.

Pip is an excellent, cross-platform replacement that can be used to search for and install new libraries and to keep track of what you already have installed.

Find a good IDE

A good IDE will help smooth out some of the roadblocks you are sure to encounter while learning a new language.  Some of the key features you should look for in an IDE are:

  • A command line
  • A script editor
  • Integrated help
  • A workspace view

Together these features make it very easy to build a script incrementally, exploring and learning as you go without having to jump around between multiple tools.

So far, my favorite IDE for Python is IEP because it has all of the features noted above.

Go through a tutorial

It is always easier to learn a language if you can start with a grounding in the basic commands before tackling harder problems.  I neglected this step and it cost me a lot of time and energy.

Learn basic exploratory commands

There are three commands that are indispensable when learning python:

  • type()
  • dir()
  • help()

Between these functions, you can determine the type of an object, most of the methods you can use on the object and how to use those methods.  Of particular note is that you can pass object.<method> to help to get details for an individual method.

Remember that Python is supposed to be easy

Whenever you feel that a task is harder than it should be, you are probably right.  Take a step back and think about what you are trying to accomplish.  There probably is a method or library out there that will make your task easy.  Look at the documentation for the object(s) you are working with and/or search for a relevant library.  Even if you cannot find a method or library to solve the problem directly, someone on stackoverflow.com has probably asked a similar question and gotten a good answer for how to approach the problem.

While attempting to write out an algorithm to find all triangles (a->b->c->a) from an adjacency matrix, I stumbled across functionality in the igraph library for R that makes this process trivial to implement for any pattern.

library(igraph)
#create an arbitrary graph
adjacency.graph <- erdos.renyi.game(100,0.1,directed=TRUE)
#create the pattern to match
triangle <- graph.ring(3,directed=TRUE)
#find all matches
adjacency.triangles <- graph.get.subisomorphisms.vf2(adjacency.graph,triangle)
#define default formatting
adjacency.graph.colored<-adjacency.graph
V(adjacency.graph.colored)$shape = "none"
V(adjacency.graph.colored)$label.font = 0.5
E(adjacency.graph.colored)$color="black"
#map matches back to the original graph
for (vec in adjacency.triangles) {
V(adjacency.graph.colored)[c(vec)+1]$label.color="dark red"
E(adjacency.graph.colored,path=c(vec,vec[1])+1)$color="red"
}
plot(adjacency.graph.colored,edge.arrow.size=0.2)

sample plotOf particular note is the fact that the heavy lifting of this script is all done using a single command (graph.get.subisomorphisms.vf2()).  Everything else in the script functions as a wrapper to pull data in or present the results.

Be cautious when executing the above script for large data sets.  Since we are interested in triangles, the complexity of this algorithm scales with O(n^3) where n is the number of edges.  On my machine, ~1000 edges took less than a second to process, but ~50000 edges took 35 seconds to process.  Looking for more complex shapes would cause the complexity to further increase.

During the orientation for the new University of Washington data science certificate program, we were prompted to answer several questions about our background and what we were looking for out of the program.  After answering these questions and hearing the responses of others, I felt that the question of our background in data science deserved a more detailed response and discussion than was practical during the orientation.

What is data science?

Before tackling the question of my background in data science, I want to make clear what I see data science to be.

I feel very strongly that data science is finding patterns in data.  It is more than just a collection of concepts like “big data” or technologies like Hadoop. It is a way of approaching the world, identifying data sets that you encounter and the patterns contained within.  Just like the scientific method, it has formalized components, but can be practiced informally in all parts of your life.

My background in data science

While I have not had any formal training in data science (it being a relatively new field), I see data science everywhere I look in my background.

While studying to be a biochemist, I was drawn to classes that involved programming, data processing and statistics. Analytical chemistry in particular was a great match for me because it was entirely focused on how to collect data and extract meaning from it.  I didn’t know it at the time, but I was already becoming a data scientist.

While doing research after graduation, I was frequently the go-to person for coming up with ways to analyze data and for troubleshooting problems where the data we were collecting was not what we expected.  Again, I was practicing data science, but doing so with rather crude tools.

When I worked as a lab technician, I was much less useful at executing a set procedure than I was at identifying patterns from systematic errors in our tests and how we could fix them.

Switching career paths from Chemistry to database administration seemed odd to many of my potential employers, but felt totally natural to me.  In both career paths, I focused on collecting data, finding patterns, and correcting systematic problems.  The only differences were the technologies I used and the kind of data I was looking at.

For the first Meme Monday, Thomas LaRock (blog twitter) has asked us to write a blog post in 11 words or less.

I was tagged by Dev Nambi (blog twitter) and Will Sisney (blog twitter) to participate so here is my entry:

Twitter and #sqlhelp: There is no better SQL resource

On occasion, I have encountered folders or files getting flagged with incorrect attributes.  Some of these are easy to address, such as read only.  The more problematic changes tend to be when something gets flagged as system.

Fortunately, powershell offers a very concise and easy way to reset the attributes of a folder or file back to their default values.

$path = 'F:\stuff'
$dir = get-item $path -force
$dir.attributes = 'Normal'

This script will get an item that may otherwise be inaccessible and reset its attributes back to whatever is normal for that item type.  Just replace the value for $path with the path to the item you are concerned with.

Script Disclaimer, for people who need to be told this sort of thing:

Never trust any script, including those that you find here, until you understand exactly what it does and how it will act on your systems. Always check the script on a test system or Virtual Machine, not a production system. Yes, there are always multiple ways to do things, and this script may not work in every situation, for everything. It’s just a script, people. All scripts on this site are performed by a professional stunt driver on a closed course. Your mileage may vary. Void where prohibited. Offer good for a limited time only. Keep out of reach of small children. Do not operate heavy machinery while using this script. If you experience blurry vision, indigestion or diarrhea during the operation of this script, see a physician immediately.

Credit for this fantastic disclaimer goes to Buck Woody (BlogTwitter).

Going through your databases to find and clean up users with no associated server logins can be a pain.  It is fairly easy to find scripts that will go out and find those users, and even some will drop those users automatically.  Unfortunately, sometimes it is not so simple to just drop a user from a database because they own one or more schema.

To drop these users from a database the schema owned by that user must be altered to remove the dependency.

Thus I set to work to come up with a powershell script to identify users with no associated login, the schema that need to be altered to enable the dropping of those users and create a document logging the changes to be made.

Here is the script I came up with:

$servername = 'localhost' #change to the name of the server you are interested in
$outputfile = 'C:\localhost modifications.csv'
$targetowner = 'dbo'
 
[System.Reflection.Assembly]::LoadWithPartialName("Microsoft.SqlServer.SMO") | Out-Null
 
$s = new-object ("Microsoft.SqlServer.Management.SMO.Server") $servername
 
#Uncomment this section if you want to use SQL Authentication
 
#$s.connectioncontext.loginsecure = $false $cred = get-credential
#$s.connectioncontext.set_Login($($cred.username.replace("\","")))
#$s.connectioncontext.set_securepassword($cred.password) #&gt;
 
$dbs = $s.databases | ?{-not $_.IsSystemObject}
 
#explicitly declare the collection variables as arrays
[Microsoft.SqlServer.Management.Smo.User[]]$users = $Null
[Microsoft.SqlServer.Management.Smo.Schema[]]$schemas = $Null
[PSObject[]]$output = $Null
 
#Get the users to drop and all schema owned by those users
$dbs | %{
$db = $_
[Microsoft.SqlServer.Management.Smo.User[]]$dbusers = $db.users | ?{$_.Login -eq '' -and -not $_.IsSystemObject}
$users += $dbusers
[string[]]$OrphanSchema = $Null
If ($dbusers.count -gt 0) {
$dbusers | %{$OrphanSchema += $_.EnumOwnedObjects()}
$schemas += $db.Schemas | ?{$OrphanSchema -contains $_.urn}
}
}
If ($schemas.count -gt 0) {
$output += $schemas | SELECT @{Name='Database';Expression={$_.Parent}}, Name, @{Name='Type';Expression={'Schema'}}, @{Name='Action';Expression={"Change owner from $($_.Owner) to $targetowner"}}
}
If ($Users.count -gt 0) {
$output += $Users | SELECT @{Name='Database';Expression={$_.Parent}}, Name, @{Name='Type';Expression={'User'}}, @{Name='Action';Expression={"Drop"}}
}
If ($output -eq $Null) {
$output = ''
}
$output | export-csv $outputfile -Force -notype
 
#uncomment these lines when you are ready to perform the modifications
#$schemas | %{$_.owner = $targetowner; $_.Alter()}
#$users | %{$_.Drop()}

Initially, this script should be run with the last two lines commented out to generate a csv file of all the schema the script will change and all of the users it will drop. Once you feel confident that this script will do exactly what you want it to, these lines can be uncommented to execute the changes.

Script Disclaimer, for people who need to be told this sort of thing:

Never trust any script, including those that you find here, until you understand exactly what it does and how it will act on your systems. Always check the script on a test system or Virtual Machine, not a production system. Yes, there are always multiple ways to do things, and this script may not work in every situation, for everything. It’s just a script, people. All scripts on this site are performed by a professional stunt driver on a closed course. Your mileage may vary. Void where prohibited. Offer good for a limited time only. Keep out of reach of small children. Do not operate heavy machinery while using this script. If you experience blurry vision, indigestion or diarrhea during the operation of this script, see a physician immediately.

Credit for this fantastic disclaimer goes to Buck Woody (Blog | Twitter).

Day 1 of 24 hours of PASS is off to a fantastic start and I have already learned about a number of things that I did realize powershell could do, but really am not surprised by.

Much like day 1, day 2 looks to have a fantastic set of sessions that makes me wish I had time for them all.  Thus, I have to be judicious about what will be most applicable to me at this time.

Statistics: How to Prove Everything but the Truth: Without good statistics, it is unlikely that the query optimizer will pick a good execution plan.  Knowing about how statistics work will be invaluable in making sure that the query optimizer has up to date and useful information when it is doing its job.

Preparing for Your Next Job: I may have just started a new job recently and have no intention going anywhere anytime soon, but professional development is always useful.

Storage for the DBA: I want to make sure I can give good advice when configuring storage for the database servers that I am going to support and this seems like the right place to get that information.  Plus I have seen Denny Cherry (blog | @mrdenny) present once before and he was quite entertaining, so I expect a good show this time too.

T-SQL Brush-Up: The Best Things You Forgot You Knew: It is so easy to forget about tools that I don’t use often.  Refreshers are always useful and I may even encounter toolsI have never used before.

Relational Database Design for Utter Newbies: By getting involved in the design process for new databases, DBAs can make their job down the line much easier.  Knowing how to design databases well makes it much easier for me to be successful in such endeavors.

Top 10 Design Mistakes: Everyone makes mistakes more often than they want to.  Reminders of mistakes that I make when I should know better decreases the frequency at which I will make such mistakes.

Understanding SARGability: I just finished tackling an ugly problem where SARGability was broken but it wasn’t clear why.  Knowing more about how to write where and join clauses that allow the query optimizer to do its job well will reduce these problems in the future.

Unfortunately I was unable to attend most of the original 24 hours of PASS.  Setting time aside to watch the sessions after the fact proved to be extremely challenging, so this time I am making a distinct point of attending the sessions when they first occur.

Without further ado, here are the sessions I plan to attend tomorrow.

Gather SQL Server Performance Data with PowerShell: Since I first tried out powershell, I have been hooked.  It has a pretty steep learning curve, however, so figuring out how to script new things tends to take longer than I would like. As such, collecting as many scripts as possible to adapt to my environment is a fantastic method of smoothing out the learning curve.

Hardware 201: Selecting and Sizing Database Hardware for OLTP Performance: I could read blogs, white papers and more for days on hardware and not get bored, but information always sticks better for me when I am not just reading.  Having more information about current and upcoming hardware options can help me give better advice when planning new servers or upgrades.

Isolation vs Concurrency: What Are the Choices?: Isolation and concurrency is a big issue at my company and understanding the implications of each isolation level decision and when to suggest deviations from the standard practices would be extremely valuable for me.

Identifying Costly Queries: I don’t believe it is possible to know enough about how to find and fix poor performance.  The more tools available, the better and this session looks to be full of excellent tools.

How to Rock Your Presentations: I may not be presenting yet, but I have every intention of moving in that direction.  Having a good basis of knowledge about how to present is important as I accumulate knowledge that is worth presenting.

TempDB Configuration and Management: The session description on its own sums up why I want to attend better than anything I could write.