Turbodbc - Turbocharged database access for data scientists.

Turbodbc is a Python module to access relational databases via the Open Database Connectivity (ODBC) interface. In addition to complying with the Python Database API Specification 2.0, turbodbc offers built-in NumPy support. Don't wait minutes for your results, just blink.

Features

Bulk retrieval of result sets
Built-in NumPy conversion
Bulk transfer of query parameters
Asynchronous I/O for result sets
Automatic conversion of decimal type to integer, float, and string as appropriate
Supported data types for both result sets and parameters: int, float, str, bool, datetime.date, datetime.datetime
Also provides a high-level C++11 database driver under the hood
Tested with Python 2.7, 3.4, 3.5, and 3.6
Tested on Linux and OSX

Installation

To install turbodbc, please use the following command:

pip install turbodbc

If you want to leverage turbodbc's NumPy support, please make sure to pip install numpy before installing turbodbc, since turbodbc searches for NumPy headers at installation time to determine whether NumPy support can be provided.

Since turbodbc includes C-extensions, make sure the following prerequisites are given:

Requirement	Linux (`apt-get install`)	OSX (`brew install`)
C++-compiler with C++11 support	G++-4.8 or higher	clang with OSX 10.9+
Boost library + headers*	`libboost-all-dev`	`boost`
Unixodbc library + headers	`unixodbc-dev`	`unixodbc`
Python headers	`python-dev`	use `pyenv` to install

*) The minimum viable boost setup requires the libraries variant, optional, and datetime.

Why should I use turbodbc instead of other ODBC modules?

Short answer: turbodbc is faster.

Slightly longer answer: turbodbc is faster, much faster if you want to work with NumPy.

Medium-length answer: I have tested turbodbc and pyodbc (probably the most popular Python ODBC module) with various databases (Exasol, PostgreSQL, MySQL) and corresponding ODBC drivers. I found turbodbc to be consistently faster.

For retrieving result sets, I found speedups between 1.5 and 7 retrieving plain Python objects. For inserting data, I found speedups of up to 100.

Is this completely scientific? Not at all. I have not told you about which hardware I used, which operating systems, drivers, database versions, network bandwidth, database layout, SQL queries, what is measured, and how I performed was measured.

All I can tell you is that if I exchange pyodbc with turbodbc, my benchmarks took less time, often approaching one (reading) or two (writing) orders of magnitude. Give it a spin for yourself, and tell me if you liked it.

Smooth. What is the trick?

Turbodbc exploits buffering.

Turbodbc implements both sending parameters and retrieving result sets using buffers of multiple rows/parameter sets. This avoids round trips to the ODBC driver and (depending how well the ODBC driver is written) to the database.
Multiple buffers are used for asynchronous I/O. This allows to interleave Python object conversion and direct database interaction (see performance options below).
Buffers contain binary representations of data. NumPy arrays contain binary representations of data. Good thing they are often the same, so instead of converting we can just copy data.

Supported data types

The following data types are supported:

Database type(s)	Python type	NumPy type	Notes
integers, Decimal(<19,0)	`int`	`int64`
floating point, Decimal(x, >0)	`float`	`float64`
bit, boolean-like	`bool`	`bool_`
timestamp, time	`datetime.datetime`	`datetime64[us]`
date	`datetime.date`	`datetime64[D]`
strings, VARCHAR, Decimal(>18, 0)	`unicode`	`object_`	slow, WIP

NumPy types are not yet supported for parameters.

Basic usage

Turbodbc follows the specification of the Python database API v2, which you can find at https://www.python.org/dev/peps/pep-0249/. Here is a short summary, including the parts not specified.

To establish a connection, use any of the following commands:

>>> from turbodbc import connect
>>> connection = connect(dsn='My data source name as given by odbc.ini')
>>> connection = connect(dsn='my dsn', user='my user has precedence')
>>> connection = connect(dsn='my dsn', username='field names may depend on the driver')

To execute a query, you need a cursor object:

>>> cursor = connection.cursor()

Here is how to execute a SELECT query:

>>> cursor.execute('SELECT 42')
>>> for row in cursor:
>>>     print list(row)

Here is how to execute an INSERT query with many parameters:

>>> parameter_sets = [['hi', 42],
                      ['there', 23]]
>>> cursor.executemany('INSERT INTO my_table VALUES (?, ?)',
                       parameter_sets)

NumPy support

Here is how to retrieve a result set in the form of NumPy arrays:

>>> cursor.execute("SELECT A, B FROM my_table")
>>> cursor.fetchallnumpy()
OrderedDict([('A', masked_array(data = [42 --],
                                mask = [False True],
                                fill_value = 999999)),
             ('B', masked_array(data = [3.14 2.71],
                                mask = [False False],
                                fill_value = 1e+20))])

Please note a few things:

The return value is an OrderedDict of column name/value pairs. The column order is the same as in your query.
The column values are MaskedArrays. Any NULL values you have in your database will show up as masked entries (NULL values in string-like columns will shop up as None objects).
NumPy support is limited to result sets, experimental, and will probably change with the next iterations.

Performance options

Turbodbc offers some options to tune the performance for your database:

>>> from turbodbc import Megabytes
>>> connect(dsn="my dsn",
            read_buffer_size=Megabytes(100),
            parameter_sets_to_buffer=5000,
            use_async_io=True)

read_buffer_size affects how many result set rows are retrieved per batch of results. Set the attribute to turbodbc.Megabytes(42) to have turbodbc determine the optimal number of rows per batch so that the total buffer amounts to 42 MB. This is recommended for most users and databases. You can also set the attribute to turbodbc.Rows(13) if you would like to fetch results in batches of 13 rows. By default, turbodbc fetches results in batches of 20 MB.

Similarly, parameter_sets_to_buffer changes the number of parameter sets which are transferred per batch of parameters (e.g., as sent with executemany()). Please note that it is not (yet) possible to use the Megabytes and Rows classes here.

Finally, set use_async_io to True if you would like to benefit from asynchronous I/O operations (limited to result sets for the time being). Asynchronous I/O means that while the main thread converts result set rows retrieved from the database to Python objects, another thread fetches a new batch of results from the database in the background. This may yield significant speedups when retrieving and converting are similarly fast operations.

Asynchronous I/O is experimental and has to fully prove itself yet.
Don't be afraid to give it a try, though.

Development version

To use the latest version of turbodbc, you need to follow these steps:

Get the source code from github
Check the source build requirements (see below) are installed on your computer
Create a build directory. Make this your working directory.
Execute the following command:
```
cmake -DCMAKE_INSTALL_PREFIX=./dist /path/to/source/directory
```
This will prepare the build directory for the actual build step.
Execute the make command to build the code.
You can execute the tests with ctest.
To create a Python source distribution for simple installation, use the following commands:
```
make install
cd dist
python setup.py sdist
```
This will create a .tar.gz file in the folder dist/dist in your build directory. This file is self-contained and can be installed by other users using pip install.

Source build requirements

For the development build, you also require the following additional dependencies:

CMake

Supported environments

Linux (successfully built on Ubuntu 12, Ubuntu 14, Debian 7, Debian 8)
OSX (successfully built on Sierra a.k.a. 10.12 and El Capitan a.k.a. 10.11)
Python 2.7, 3.4, 3.5, 3.6
More environments probably work as well, but these are the versions that are regularly tested on Travis or local development machines.

Supported databases

Turbodbc uses suites of unit and integration tests to ensure quality. Every time turbodbc's code is updated on GitHub, turbodbc is automatically built from scratch and tested with the following databases:

PostgreSQL
MySQL

During development, turbodbc is tested with the following database:

Exasol

Releases will not be made if any (implemented) test fails for any of the databases listed above. In addition to these well-supported databases, the following databases are tested on an irregular basis:

MSSQL with FreeTDS
MSSQL with Microsoft's official ODBC driver

These database/driver combinations do not yet pass all tests.

There is a good chance that turbodbc will work with other, totally untested databases as well. There is, however, an equally good chance that you will encounter compatibility issues. If you encounter one, please take the time to report it so turbodbc can be improved to work with more real-world databases. Thanks!

I got questions and issues to report!

In this case, please use turbodbc's issue tracker on GitHub.

Is there a guided tour through turbodbc's entrails?

Yes, there is! Check out this blog post on the making of turbodbc.

Is turbodbc on Twitter?

Yes, it is! Just follow @turbodbc for the latest turbodbc talk and news about related technologies.

Name		Name	Last commit message	Last commit date
Latest commit History 691 Commits
cmake_scripts		cmake_scripts
cpp		cpp
docs		docs
google_test		google_test
page		page
performance_scripts		performance_scripts
pybind11 @ f3de2d5		pybind11 @ f3de2d5
python		python
travis		travis
.gitignore		.gitignore
.gitmodules		.gitmodules
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py
test_requirements.txt		test_requirements.txt

License

rlugojr/turbodbc

Folders and files

Latest commit

History

Repository files navigation