SQLQuery Class¶

ibmcloudsql.SQLQuery provides the following single class:

SQLQuery class

An SQLQuery class is also a COSClient class, and an SQLBuilder class. This way the SQLQuery class acts as the central consolidated API interface for the entire ibmcloudsql module.

sqlClient = SQLQuery(....)

SQLQuery (…): instantiate with initial settings
SQLQuery.configure(): update the setting

Help¶

analyze(): If you are stuck with a query that takes too long, try this one. It may provide you with suggestions on how to improve your query or revise your data source.

Submit SQL jobs¶

SQLQuery.submit_sql()
SQLQuery.run_sql(), SQLQuery.execute_sql()
submit_and_track_sql()
get_schema_data()
submit() and run(): The two new methods run an SQL statement generated by the approach provided by SQLBuilder class.

Work with query results¶

get_result()
delete_result()
rename_exact_result(): Modify the created objects on Cloud Object Storage.
rename_exact_result_joblist()
delete_empty_objects()
list_results()

Manage jobs¶

Manage Cloud Object Storage URL¶

See COSClient

Manage table catalog¶

From HiveMetastore

Data skipping¶

[Not available yet]

Limitations¶

The SQL statement string size limit is 200KB.
Maximum five concurrent SQL queries for a standard SQL Query instance.
Maximum duration of one hour for a query job. However, many jobs can be stopped much earlier due to the current mechanism of AIM token timeout, and this token is shared across all current SQL queries.

Tips¶

Combine the SQL query if you can, as there is an overhead (and possibly $ cost) for a REST API request. However, also consider the current limit for a YARN executor of 7.5GB, so design the SQL query accordingly. It is best if the data being accessed is organized with multiple objects of ideal sizes (see below), since this enables more parallelism in the Object Storage.
Complex data can only be stored using Json or Parquet, it is faster with Parquet.
Avoid storing the data with a single object’s size larger than 200MB. To check, consider using get_cos_summary() or list_results(). To resolve the issue, consider using the following:
- Partition table into multiple buckets/objects type-1: PARTITION INTO <x> BUCKETS/OBJECTS, with maximum allowed for ‘x’ is 50.
- Partition table into multiple buckets/objects type-2: PARTITIONED EVERY <x> ROWS.
- Hive-style partitioning: PARTITION BY (col1, col2, …).
When partitioning according to a column that has NULL values, Spark will use “__HIVE_DEFAULT_PARTITION__” in the object name, for example, <bucket>/Location=__HIVE_DEFAULT_PARTITION__/<data-partition>.

sqlClient.list_results(job_id)

ObjectURL       Size    Bucket  Object
     cos://s3.us-south.cloud-object-storage.appdomain.cloud/sql-query-cos-access-ts/jobid=a3475263-469a-4e22-b382-1d0ae8f1d1fa       0       sql-query-cos-access-ts jobid=a3475263-469a-4e22-b382-1d0ae8f1d1fa
     cos://s3.us-south.cloud-object-storage.appdomain.cloud/sql-query-cos-access-ts/jobid=a3475263-469a-4e22-b382-1d0ae8f1d1fa/_SUCCESS      0       sql-query-cos-access-ts jobid=a3475263-469a-4e22-b382-1d0ae8f1d1fa/_SUCCESS
     cos://s3.us-south.cloud-object-storage.appdomain.cloud/sql-query-cos-access-ts/jobid=a3475263-469a-4e22-b382-1d0ae8f1d1fa/part-00000-e299e734-43e3-4032-b27d-b0d7e93d51c2-c000-attempt_20200318152159_0040_m_000000_0.snappy.parquet    7060033106      sql-query-cos-access-ts jobid=a3475263-469a-4e22-b382-1d0ae8f1d1fa/part-00000-e299e734-43e3-4032-b27d-b0d7e93d51c2-c000-attempt_20200318152159_0040_m_000000_0.snappy.parquet

References¶

Sparksql-parser: The module contains code with the know how to parse an SQLCloud-specific statement and transform it into a valid SQL statement.
Grammar
Tips for data layout
Data skipping

SQLClientTimeSeries Class¶

ibmcloudsql.sql_query_ts provides the SQLClientTimeSeries class which is derived from SQLQuery class. The class provides APIs to make it easier to issue SQL statement that contains time-series functions.

Prepare data for time series¶

get_ts_datasource()

Let’s assume that you create a HIVE catalog table to store the data for fast access that is used as the data source via table_name argument.

A time series is comprised of the following:

The time_stamp information.
The observation information.
Category, for example, the key.

Very often, the raw data is too dense to be digested into a time series. Such data has to be transformed into a coarser timescale, using a proper aggregated function, for example, avg() or max(). The time window during which the summarized data point is to be collected is given by passing a value to granularity argument, as in the following example:

raw: No change, just extract to a new location.
per_sec, or PT1S: Per every second.
per_2sec, or PT2S: Per every two seconds.
per_min, or PT1M: Per every minute.
per_5min, or PT5M: Per every five minutes.

In general, a valid value to granularity follows the following conventions:

‘per_[x]sec’ and ‘per_[x]min’ with x is divisible by 60
ISO 8601 duration standard

The transformed data is then copied and saved into a new location (the time-series data source), which is specified by the following:

cos_out: Cloud Object Storage URL (stored as PARQUET).
num_objects: Split into multiple objects.
num_rows: Split into multiple objects based on number of rows per object.

At the end of the transformation, the data source to be used for time series creation comprises the following three columns:

field_name: Representing category.
time_stamp: Representing the point of time at the given granularity.
observation: Representing the recorded information.

If you use a generic name, you can quickly apply it to any data source.

User-friendly SQL string¶

Several time-series functions being used in CloudSQL accept parameters in the unit of mili-seconds, which is not user-friendly. The class provides the built-in functionality to map user-friendly name such as hour, day, 2hour, week into the right unit, before sending the query to CloudSQL service.

ibmcloudsql

Navigation

Related Topics