SQLQuery Class¶
ibmcloudsql.SQLQuery
provides the following single class:
SQLQuery
class
An SQLQuery
class is also a COSClient
class, and an SQLBuilder
class.
This way the SQLQuery
class acts as the central consolidated API interface for the entire ibmcloudsql module.
sqlClient = SQLQuery(....)
SQLQuery
(…): instantiate with initial settingsSQLQuery.configure()
: update the setting
Help¶
analyze()
: If you are stuck with a query that takes too long, try this one. It may provide you with suggestions on how to improve your query or revise your data source.
Submit SQL jobs¶
Work with query results¶
rename_exact_result()
: Modify the created objects on Cloud Object Storage.
Manage jobs¶
Manage Cloud Object Storage URL¶
See COSClient
Manage table catalog¶
From HiveMetastore
Data skipping¶
[Not available yet]
Limitations¶
The SQL statement string size limit is 200KB.
Maximum five concurrent SQL queries for a standard SQL Query instance.
Maximum duration of one hour for a query job. However, many jobs can be stopped much earlier due to the current mechanism of AIM token timeout, and this token is shared across all current SQL queries.
Tips¶
Combine the SQL query if you can, as there is an overhead (and possibly $ cost) for a REST API request. However, also consider the current limit for a YARN executor of 7.5GB, so design the SQL query accordingly. It is best if the data being accessed is organized with multiple objects of ideal sizes (see below), since this enables more parallelism in the Object Storage.
Complex data can only be stored using Json or Parquet, it is faster with Parquet.
Avoid storing the data with a single object’s size larger than 200MB. To check, consider using
get_cos_summary()
orlist_results()
. To resolve the issue, consider using the following:Partition table into multiple buckets/objects type-1: PARTITION INTO <x> BUCKETS/OBJECTS, with maximum allowed for ‘x’ is 50.
Partition table into multiple buckets/objects type-2: PARTITIONED EVERY <x> ROWS.
Hive-style partitioning: PARTITION BY (col1, col2, …).
When partitioning according to a column that has NULL values, Spark will use “__HIVE_DEFAULT_PARTITION__” in the object name, for example, <bucket>/Location=__HIVE_DEFAULT_PARTITION__/<data-partition>.
sqlClient.list_results(job_id)
ObjectURL Size Bucket Object
0 cos://s3.us-south.cloud-object-storage.appdomain.cloud/sql-query-cos-access-ts/jobid=a3475263-469a-4e22-b382-1d0ae8f1d1fa 0 sql-query-cos-access-ts jobid=a3475263-469a-4e22-b382-1d0ae8f1d1fa
1 cos://s3.us-south.cloud-object-storage.appdomain.cloud/sql-query-cos-access-ts/jobid=a3475263-469a-4e22-b382-1d0ae8f1d1fa/_SUCCESS 0 sql-query-cos-access-ts jobid=a3475263-469a-4e22-b382-1d0ae8f1d1fa/_SUCCESS
2 cos://s3.us-south.cloud-object-storage.appdomain.cloud/sql-query-cos-access-ts/jobid=a3475263-469a-4e22-b382-1d0ae8f1d1fa/part-00000-e299e734-43e3-4032-b27d-b0d7e93d51c2-c000-attempt_20200318152159_0040_m_000000_0.snappy.parquet 7060033106 sql-query-cos-access-ts jobid=a3475263-469a-4e22-b382-1d0ae8f1d1fa/part-00000-e299e734-43e3-4032-b27d-b0d7e93d51c2-c000-attempt_20200318152159_0040_m_000000_0.snappy.parquet
References¶
Sparksql-parser: The module contains code with the know how to parse an SQLCloud-specific statement and transform it into a valid SQL statement.
SQLClientTimeSeries Class¶
ibmcloudsql.sql_query_ts
provides the SQLClientTimeSeries
class which is derived from SQLQuery
class. The class provides APIs to make it easier to issue SQL statement that contains time-series functions.
Prepare data for time series¶
Let’s assume that you create a HIVE catalog table to store the data for fast access that is used as the data source via table_name argument.
A time series is comprised of the following:
The time_stamp information.
The observation information.
Category, for example, the key.
Very often, the raw data is too dense to be digested into a time series. Such data has to be transformed into a coarser timescale, using a proper aggregated function, for example, avg() or max(). The time window during which the summarized data point is to be collected is given by passing a value to granularity argument, as in the following example:
raw: No change, just extract to a new location.
per_sec, or PT1S: Per every second.
per_2sec, or PT2S: Per every two seconds.
per_min, or PT1M: Per every minute.
per_5min, or PT5M: Per every five minutes.
In general, a valid value to granularity follows the following conventions:
‘per_[x]sec’ and ‘per_[x]min’ with x is divisible by 60
ISO 8601 duration standard
The transformed data is then copied and saved into a new location (the time-series data source), which is specified by the following:
cos_out: Cloud Object Storage URL (stored as PARQUET).
num_objects: Split into multiple objects.
num_rows: Split into multiple objects based on number of rows per object.
At the end of the transformation, the data source to be used for time series creation comprises the following three columns:
field_name: Representing category.
time_stamp: Representing the point of time at the given granularity.
observation: Representing the recorded information.
If you use a generic name, you can quickly apply it to any data source.
User-friendly SQL string¶
Several time-series functions being used in CloudSQL accept parameters in the unit of mili-seconds, which is not user-friendly. The class provides the built-in functionality to map user-friendly name such as hour, day, 2hour, week into the right unit, before sending the query to CloudSQL service.