Skip to content

Turbodbc is a Python module to access relational databases via the Open Database Connectivity (ODBC) interface. The module complies with the Python Database API Specification 2.0.

License

rlugojr/turbodbc

 
 

Repository files navigation

turbodbc logo

Turbodbc - Turbocharged database access for data scientists.

Build Status

Turbodbc is a Python module to access relational databases via the Open Database Connectivity (ODBC) interface. In addition to complying with the Python Database API Specification 2.0, turbodbc offers built-in NumPy support. Don't wait minutes for your results, just blink.

Features

  • Bulk retrieval of result sets
  • Built-in NumPy conversion
  • Bulk transfer of query parameters
  • Asynchronous I/O for result sets
  • Automatic conversion of decimal type to integer, float, and string as appropriate
  • Supported data types for both result sets and parameters: int, float, str, bool, datetime.date, datetime.datetime
  • Also provides a high-level C++11 database driver under the hood
  • Tested with Python 2.7, 3.4, 3.5, and 3.6
  • Tested on Linux and OSX

Installation

To install turbodbc, please use the following command:

pip install turbodbc

If you want to leverage turbodbc's NumPy support, please make sure to pip install numpy before installing turbodbc, since turbodbc searches for NumPy headers at installation time to determine whether NumPy support can be provided.

Since turbodbc includes C-extensions, make sure the following prerequisites are given:

Requirement Linux (apt-get install) OSX (brew install)
C++-compiler with C++11 support G++-4.8 or higher clang with OSX 10.9+
Boost library + headers* libboost-all-dev boost
Unixodbc library + headers unixodbc-dev unixodbc
Python headers python-dev use pyenv to install

*) The minimum viable boost setup requires the libraries variant, optional, and datetime.

Why should I use turbodbc instead of other ODBC modules?

Short answer: turbodbc is faster.

Slightly longer answer: turbodbc is faster, much faster if you want to work with NumPy.

Medium-length answer: I have tested turbodbc and pyodbc (probably the most popular Python ODBC module) with various databases (Exasol, PostgreSQL, MySQL) and corresponding ODBC drivers. I found turbodbc to be consistently faster.

For retrieving result sets, I found speedups between 1.5 and 7 retrieving plain Python objects. For inserting data, I found speedups of up to 100.

Is this completely scientific? Not at all. I have not told you about which hardware I used, which operating systems, drivers, database versions, network bandwidth, database layout, SQL queries, what is measured, and how I performed was measured.

All I can tell you is that if I exchange pyodbc with turbodbc, my benchmarks took less time, often approaching one (reading) or two (writing) orders of magnitude. Give it a spin for yourself, and tell me if you liked it.

Smooth. What is the trick?

Turbodbc exploits buffering.

  • Turbodbc implements both sending parameters and retrieving result sets using buffers of multiple rows/parameter sets. This avoids round trips to the ODBC driver and (depending how well the ODBC driver is written) to the database.
  • Multiple buffers are used for asynchronous I/O. This allows to interleave Python object conversion and direct database interaction (see performance options below).
  • Buffers contain binary representations of data. NumPy arrays contain binary representations of data. Good thing they are often the same, so instead of converting we can just copy data.

Supported data types

The following data types are supported:

Database type(s) Python type NumPy type Notes
integers, Decimal(<19,0) int int64
floating point, Decimal(x, >0) float float64
bit, boolean-like bool bool_
timestamp, time datetime.datetime datetime64[us]
date datetime.date datetime64[D]
strings, VARCHAR, Decimal(>18, 0) unicode object_ slow, WIP

NumPy types are not yet supported for parameters.

Basic usage

Turbodbc follows the specification of the Python database API v2, which you can find at https://www.python.org/dev/peps/pep-0249/. Here is a short summary, including the parts not specified.

To establish a connection, use any of the following commands:

>>> from turbodbc import connect
>>> connection = connect(dsn='My data source name as given by odbc.ini')
>>> connection = connect(dsn='my dsn', user='my user has precedence')
>>> connection = connect(dsn='my dsn', username='field names may depend on the driver')

To execute a query, you need a cursor object:

>>> cursor = connection.cursor()

Here is how to execute a SELECT query:

>>> cursor.execute('SELECT 42')
>>> for row in cursor:
>>>     print list(row)

Here is how to execute an INSERT query with many parameters:

>>> parameter_sets = [['hi', 42],
                      ['there', 23]]
>>> cursor.executemany('INSERT INTO my_table VALUES (?, ?)',
                       parameter_sets)

NumPy support

Here is how to retrieve a result set in the form of NumPy arrays:

>>> cursor.execute("SELECT A, B FROM my_table")
>>> cursor.fetchallnumpy()
OrderedDict([('A', masked_array(data = [42 --],
                                mask = [False True],
                                fill_value = 999999)),
             ('B', masked_array(data = [3.14 2.71],
                                mask = [False False],
                                fill_value = 1e+20))])

Please note a few things:

  • The return value is an OrderedDict of column name/value pairs. The column order is the same as in your query.
  • The column values are MaskedArrays. Any NULL values you have in your database will show up as masked entries (NULL values in string-like columns will shop up as None objects).
  • NumPy support is limited to result sets, experimental, and will probably change with the next iterations.

Performance options

Turbodbc offers some options to tune the performance for your database:

>>> from turbodbc import Megabytes
>>> connect(dsn="my dsn",
            read_buffer_size=Megabytes(100),
            parameter_sets_to_buffer=5000,
            use_async_io=True)

read_buffer_size affects how many result set rows are retrieved per batch of results. Set the attribute to turbodbc.Megabytes(42) to have turbodbc determine the optimal number of rows per batch so that the total buffer amounts to 42 MB. This is recommended for most users and databases. You can also set the attribute to turbodbc.Rows(13) if you would like to fetch results in batches of 13 rows. By default, turbodbc fetches results in batches of 20 MB.

Similarly, parameter_sets_to_buffer changes the number of parameter sets which are transferred per batch of parameters (e.g., as sent with executemany()). Please note that it is not (yet) possible to use the Megabytes and Rows classes here.

Finally, set use_async_io to True if you would like to benefit from asynchronous I/O operations (limited to result sets for the time being). Asynchronous I/O means that while the main thread converts result set rows retrieved from the database to Python objects, another thread fetches a new batch of results from the database in the background. This may yield significant speedups when retrieving and converting are similarly fast operations.

Asynchronous I/O is experimental and has to fully prove itself yet.
Don't be afraid to give it a try, though.

Development version

To use the latest version of turbodbc, you need to follow these steps:

  • Get the source code from github

  • Check the source build requirements (see below) are installed on your computer

  • Create a build directory. Make this your working directory.

  • Execute the following command:

    cmake -DCMAKE_INSTALL_PREFIX=./dist /path/to/source/directory
    

    This will prepare the build directory for the actual build step.

  • Execute the make command to build the code.

  • You can execute the tests with ctest.

  • To create a Python source distribution for simple installation, use the following commands:

    make install
    cd dist
    python setup.py sdist
    

    This will create a .tar.gz file in the folder dist/dist in your build directory. This file is self-contained and can be installed by other users using pip install.

Source build requirements

For the development build, you also require the following additional dependencies:

  • CMake

Supported environments

  • Linux (successfully built on Ubuntu 12, Ubuntu 14, Debian 7, Debian 8)
  • OSX (successfully built on Sierra a.k.a. 10.12 and El Capitan a.k.a. 10.11)
  • Python 2.7, 3.4, 3.5, 3.6
  • More environments probably work as well, but these are the versions that are regularly tested on Travis or local development machines.

Supported databases

Turbodbc uses suites of unit and integration tests to ensure quality. Every time turbodbc's code is updated on GitHub, turbodbc is automatically built from scratch and tested with the following databases:

  • PostgreSQL
  • MySQL

During development, turbodbc is tested with the following database:

  • Exasol

Releases will not be made if any (implemented) test fails for any of the databases listed above. In addition to these well-supported databases, the following databases are tested on an irregular basis:

  • MSSQL with FreeTDS
  • MSSQL with Microsoft's official ODBC driver

These database/driver combinations do not yet pass all tests.

There is a good chance that turbodbc will work with other, totally untested databases as well. There is, however, an equally good chance that you will encounter compatibility issues. If you encounter one, please take the time to report it so turbodbc can be improved to work with more real-world databases. Thanks!

I got questions and issues to report!

In this case, please use turbodbc's issue tracker on GitHub.

Is there a guided tour through turbodbc's entrails?

Yes, there is! Check out this blog post on the making of turbodbc.

Is turbodbc on Twitter?

Yes, it is! Just follow @turbodbc for the latest turbodbc talk and news about related technologies.

About

Turbodbc is a Python module to access relational databases via the Open Database Connectivity (ODBC) interface. The module complies with the Python Database API Specification 2.0.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 83.0%
  • Python 13.2%
  • CMake 2.0%
  • Makefile 1.5%
  • Other 0.3%