The College of Public Health and Health Professions uses SAS/Access to move data to and from the PHHP MySQL database. SAS/Access provides fast uploads of large database tables as well as the ability to pass SQL queries to the MySQL database where queries on indexed fields provide for fast subsetting and sorting of large datasets.
Methods and Requirements | Libnam | Proc SQL | SQL Pass Through | Performance Tests
Methods and Requirements
Communication with the MySQL server can be done via the MySQL support in SAS or via ODBC. To use the MySQL support you will need SAS 9.1.2 or higher. ODBC support has been in SAS for quite some time. Current documentation for all of SAS can be found online at SAS 9.1.3 Documentation.
SAS Institute recommends the use of the SAS libname statement along with Proc SQL to make connections and interact with databases. A libname statement defines a name called a “libref” that will point to some data source. That data source can be a file, directory, or database on some system. The libref can then be used as shorthand for that data source in other SAS statements. SAS provides lots of engines so it can read (and usually write to) SAS, SPSS, MySQL, ODBC, Oracle, MS SQL and many other forms of content.
Proc SQL is a command in SAS which can be used to execute SQL against data sources, i.e. librefs. For remote systems it can be instructed to do an SQL pass-through which sends the query to the database server so it can be executed more efficiently. This is more efficient when doing queries that reduce the dataset with a “where” clause or aggregate it with a “group by” clause.
Libname
The libname statement points at container which might be a directory, a database, or–in special cases–a single file.
ODBC & MySQL Connectors
For the ODBC and the MySQL connectors, the container is a database. The objects inside are tables. A libname called “fin” connecting to an ODBC data source name (DSN) of “MySQL-Database” as user “alice” to access the “finance” database would look like this:
libname fin ODBC datasrc="MySQL-Database" user=alice password=our_little_secret schema=finance;
Note that we have not yet referred to a table. To access a table, use refer to it as “libref.table_name”. E.g. To refer to the table ‘payables’ in the fin libref, write
fin.payables
The libname statement also accepts database options. Should you be creating MySQL tables through the ODBC interface, this libname statement would guarantee that they are InnoDB tables.
libname fin ODBC datasrc="MySQL-Database" user=alice password=our_little_secret schema=finance DBCREATE_TABLE_OPTS='TYPE=InnoDB';
SAS Connector
The SAS connector use directories as containers. The objects inside the containers are just files. Revising the above example, here’s a libname called “fin” pointing a directory c:\finance that we will access via the SAS connector:
libname fin 'c:\finance';
To access a sas dataset in the “c:\finance” directory, refer to it as “libref.base-filename”. To refer to the SAS dataset “payables” in the file “c:\finance\payables.sas7bdat”, write
fin.payables
SPSS Connector
The SPSS connector uses a full path to a filename as the container. As it is a file is has only one object inside. Revising the above example, here’s a libname called “fin” pointing a file c:\finance\payables.por that we will access via the SPSS connector:
libname fin SPSS 'c:\finance\payables.por';
To access an SPSS dataset with a libref, refer to it as “libref._first_”. To refer to the SPSS dataset at “c:\finance\payables.por”, with the libref defined above write
fin._first_
Proc SQL
Proc SQL can be used to access any datasets SAS can access using SQL. An example that would work the ODBC or SAS data sources defined above would look this:
proc sql; select account, balance, due_date from fin.payables where balance > 100 order by due_date asc; quit;
Proc SQL can also create datasets via Proc SQL. You could create a table in a database via the ODBC connector and copy a local dataset into it like this:
libname remote ODBC datasrc="MySQL-Database" user=alice password=our_little_secret schema=finance DBCREATE_TABLE_OPTS='TYPE=InnoDB'; libname local 'c:\finance'; proc sql; create table remote.payables as select * from local.payables ; quit;
To insert the first 3000 records of a local dataset into an existing table you could do this:
libname remote ODBC datasrc="MySQL-Database" user=alice password=our_little_secret schema=finance; libname local 'c:\finance'; proc sql inobs=3000 outobs=3000; insert into remote.payables select * from local.payables; quit;
SQL Pass Through
To boost speed of select statements against remote databases, use Remote SQL Pass Through (RSPT):
proc sql; connect to odbc as remote (datasrc="MySQL-Database" user=alice password=our_little_secret schema=finance); select * from connection to remote ( select account, balance, due_date from fin.payables where balance > 100 order by due_date asc ); quit;
Performance tests were run using the ODBC interface to communicate with a MySQL 4.0 server to assess the merits of remote SQL pass through. These results were found:
Test | Rows | Columns | Size in MySQL (mb) | Size in local format (mb) | Time(s) | mb/s | row/s |
---|---|---|---|---|---|---|---|
select returning 699717 rows, with index and RSPT | 3,078,064 | 25 | 445 | N/A | 1.6 | N/A | 1,923,790 |
select returning 699717 rows, with index but not RSPT | 3,078,064 | 25 | 445 | N/A | 115.6 | N/A | 26,627 |
More disturbing than the 70-fold loss of performance, opening the result sets within SAS causes SAS to re-execute the query. Thus casual manipulation of the results sets is quite painful if the query is not created carefully and executed with RSPT.
Performance Tests
Performance tests were run using the ODBC interface to communicate with a MySQL 4.0 server to assess the data load speed from SAS. These results were found:
Test | Rows | Columns | Size in MySQL (mb) | Size in local format (mb) | Time(s) | mb/s | row/s |
---|---|---|---|---|---|---|---|
create and insert on server | 100,000 | 40 | 19.6 | 25 (SAS) | 53 | 0.37 | 1887 |
create and insert on server | 3,078,064 | 25 | 445 | 451 | 1315 | 0.34 | 2341 |
Performance test were run with mysqlcc and mysql against a MySQL 4.0 server to assess the value of indicies. These results were found:
Test | Rows | Columns | Size in MySQL (mb) | Size in local format (mb) | Time(s) | mb/s | row/s |
---|---|---|---|---|---|---|---|
table scan returning 1 row, without index | 3,078,064 | 25 | 445 | N/A | 11.6 | N/A | N/A |
table scan returning 1 row, with index | 3,078,064 | 25 | 445 | N/A | 7.27 | N/A | N/A |
select returning 699717 rows, with index | 3,078,064 | 25 | 445 | N/A | 1.28 | N/A | 2,404,737 |
select returning 699717 rows, with index on 5 of 9 chars in varchar(9) | 3,078,064 | 25 | 445 | N/A | 5.67 | N/A | 542,868 |
select returning 699717 rows, without index | 3,078,064 | 25 | 445 | N/A | 8.62 | N/A | 357,083 |
create index on varchar(9) | 100,000 | N/A | N/A | N/A | 1.90 | N/A | 52,632 |
create index on varchar(9) | 3,078,064 | N/A | N/A | N/A | 59.92 | N/A | 51,370 |
create index on varchar(9)(5) | 3,078,064 | N/A | N/A | N/A | 70.57 | N/A | 43,618 |
create index on varchar(17) | 3,078,064 | N/A | N/A | N/A | 64.69 | N/A | 47,582 |
create index on decimal(11,0) | 3,078,064 | N/A | N/A | N/A | 78.88 | N/A | 39,022 |