We use DB Browser for SQLite and the Alzheimer’s Disease Neuroimaging Initiative Teaching dataset throughout this lesson. See Setup for instructions on how to download the data, and also how to install DB Browser for SQLite.
To start, let’s orient ourselves in our project workflow. Previously, we used Excel and OpenRefine to go from messy, human created data to cleaned, computer-readable data. Now we’re going to move to the next piece of the data workflow, using the computer to read in our data, and then use it for analysis and visualization.
SQL stands for Structured Query Language. SQL allows us to interact with relational databases through queries. These queries can allow you to perform a number of actions such as: insert, update and delete information in a database.
The data we will be using come from a longitudinal study of Alzheimer’s disease called the Alzheimer’s Disease Neuroimaging Initiative (ADNI). ADNI began in 2004 and includes data gathered by investigators at 59 research centers across North America. Participants in the study are older adults along the full spectrum of cognitive health, from Normal Control (NC) through Early and Late Mild Cognitive Impairment (EMCI/LMCI) and Alzheimer’s dementia (AD). The dataset includes individual-level demographics, such as age, sex, and education; measures of cognitive performance, such as scores on tests of memory and attention; and results of biomarker assessment, including genetic markers of risk, brain volumes, and amounts of AD proteins in brain and cerebrospinal fluid (CSF).
This is a real dataset that has been used in over 1700 publications. We’ve simplified it for the workshop, removing or altering some information about individual patients in the process. ADNI is a fairly open dataset, meaning that many researchers who are not part of the original team have access to the data. However, because these data are about real patients, some extra permissions are required for each new use of the data. If you’re interested in taking the work we do today further with the full dataset, you should submit a separate data request to ADNI.
First, let’s download and look at some of the cleaned spreadsheets from the ADNI Teaching dataset.
We’ll need the following files:
In order to answer the questions described above, we’ll need to do the following basic data operations:
In addition, we don’t want to do this manually! Instead of searching for the right pieces of data ourselves, or clicking between spreadsheets, or manually sorting columns, we want to make the computer do the work.
In particular, we want to use a tool where it’s easy to repeat our analysis in case our data changes. We also want to do all this searching without actually modifying our source data.
Putting our data into a relational database and using SQL will help us achieve these goals.
A relational database stores data in relations made up of records with fields. The relations are usually represented as tables; each record is usually shown as a row, and the fields as columns. In most cases, each record will have a unique identifier, called a key, which is stored as one of its fields. Records may also contain keys that refer to records in other tables, which enables us to combine information from two or more sources.
Using a relational database serves several purposes.
There are a number of different database management systems for working with relational data. We’re going to use SQLite today, but basically everything we teach you will apply to the other database systems as well (e.g. MySQL, PostgreSQL, MS Access, MS SQL Server, Oracle Database and Filemaker Pro). The only things that will differ are the details of exactly how to import and export data and the details of data types.
Let’s look at a pre-existing database, the adni.sqlite file from the ADNI Teaching dataset that we downloaded during Setup. Click on the “Open Database” button, select the adni.sqlite file, and click “Open” to open the database.
You can see the tables in the database by looking at the left hand side of the screen under Database Structure tab. Here you will see a list under “Tables.” Each item listed here corresponds to one of the csv files we were exploring earlier. To see the contents of any table, click on it, and then click the “Browse Data” tab next to the “Database Structure” tab. This will give us a view that we’re used to - just a copy of the table. Hopefully this helps to show that a database is, in some sense, just a collection of tables, where there’s some value in the tables that allows them to be connected to each other (the “related” part of “relational database”).
The “Database Structure” tab also provides some metadata about each table. If you click on the down arrow next to a table name, you will see information about the columns, which in databases are referred to as “fields,” and their assigned data types.
(The rows of a database table are called records.) Each field contains one variety or type of data, often numbers or text. You can see in the biomark table that most fields contain numbers (BIGINT, or big integer, and FLOAT, or floating point numbers/decimals) while the demog table is mostly made up of text fields.
The “Execute SQL” tab is blank now - this is where we’ll be typing our queries to retrieve information from the database tables.
Before we get started with writing our own queries, we’ll create our own database. We’ll be creating this database from the three csv files we downloaded earlier. Close the currently open database (File > Close Database) and then follow these instructions:
Field | Data Type | Motivation | Table(s) |
---|---|---|---|
RID | INTEGER | Field contains an ID coded as an integer | biomark |
PTID | TEXT | Field contains an ID coded as text | biomark |
VISCODE | TEXT | Field contains text data | biomark |
SITE | INTEGER | Field contains an ID coded as an integer | biomark |
COLPROT | TEXT | Field contains text data | biomark |
ORIGPROT | TEXT | Field contains text data | biomark |
EXAMDATE_mod | TEXT | Field contains a date coded as text | biomark |
FDG | REAL | Field contains measured numerical data | biomark |
PIB | REAL | Field contains measured numerical data | biomark |
AV45 | REAL | Field contains measured numerical data | biomark |
ABETA_mod | REAL | Field contains measured numerical data | biomark |
TAU_mod | REAL | Field contains measured numerical data | biomark |
PTAU_mod | REAL | Field contains measured numerical data | biomark |
FDG_bl | REAL | Field contains measured numerical data | biomark |
PIB_bl | REAL | Field contains measured numerical data | biomark |
AV45_bl | REAL | Field contains measured numerical data | biomark |
ABETA_bl_mod | REAL | Field contains measured numerical data | biomark |
TAU_bl_mod | REAL | Field contains measured numerical data | biomark |
PTAU_bl_mod | REAL | Field contains measured numerical data | biomark |
Finally, click OK one more time to confirm the operation.
You can also use this same approach to append new fields to an existing table.
Data type | Description |
---|---|
CHARACTER(n) | Character string. Fixed-length n |
VARCHAR(n) or CHARACTER VARYING(n) | Character string. Variable length. Maximum length n |
BINARY(n) | Binary string. Fixed-length n |
BOOLEAN | Stores TRUE or FALSE values |
VARBINARY(n) or BINARY VARYING(n) | Binary string. Variable length. Maximum length n |
INTEGER(p) | Integer numerical (no decimal). |
SMALLINT | Integer numerical (no decimal). |
INTEGER | Integer numerical (no decimal). |
BIGINT | Integer numerical (no decimal). |
DECIMAL(p,s) | Exact numerical, precision p, scale s. |
NUMERIC(p,s) | Exact numerical, precision p, scale s. (Same as DECIMAL) |
FLOAT(p) | Approximate numerical, mantissa precision p. A floating number in base 10 exponential notation. |
REAL | Approximate numerical |
FLOAT | Approximate numerical |
DOUBLE PRECISION | Approximate numerical |
DATE | Stores year, month, and day values |
TIME | Stores hour, minute, and second values |
TIMESTAMP | Stores year, month, day, hour, minute, and second values |
INTERVAL | Composed of a number of integer fields, representing a period of time, depending on the type of interval |
ARRAY | A set-length and ordered collection of elements |
MULTISET | A variable-length and unordered collection of elements |
XML | Stores XML data |
Different databases offer different choices for the data type definition.
The following table shows some of the common names of data types between the various database platforms:
Data type | Access | SQLServer | Oracle | MySQL | PostgreSQL |
---|---|---|---|---|---|
boolean | Yes/No | Bit | Byte | N/A | Boolean |
integer | Number (integer) | Int | Number | Int / Integer | Int / Integer |
float | Number (single) | Float / Real | Number | Float | Numeric |
currency | Currency | Money | N/A | N/A | Money |
string (fixed) | N/A | Char | Char | Char | Char |
string (variable) | Text ( <256) / Memo (65k+) | Varchar | Varchar2 | Varchar | Varchar |
binary object OLE Object Memo Binary (fixed up to 8K) | Varbinary ( <8K) | Image ( <2GB) Long | Raw Blob | Text Binary | Varbinary |