.. _sql_query-label: SQLQuery Class ================================================ :mod:`ibmcloudsql.SQLQuery` provides the following single class: * :py:class:`.SQLQuery` class An :class:`.SQLQuery` class is also a :class:`.COSClient` class, and an :class:`.SQLBuilder` class. This way the :class:`.SQLQuery` class acts as the central consolidated API interface for the entire ibmcloudsql module. .. code-block:: python sqlClient = SQLQuery(....) 1. :py:class:`.SQLQuery` (...): instantiate with initial settings 2. :meth:`.SQLQuery.configure`: update the setting Help ------------ .. 1. :meth:`.help` 2. :meth:`.sql_info` 3. :meth:`.get_job_demo` 4. :meth:`.get_cos_summary_demo` 5. :meth:`.list_results_demo` 1. :meth:`.analyze`: If you are stuck with a query that takes too long, try this one. It may provide you with suggestions on how to improve your query or revise your data source. Submit SQL jobs ------------------- 1. :meth:`.SQLQuery.submit_sql` 2. :meth:`.SQLQuery.run_sql`, :meth:`.SQLQuery.execute_sql` 3. :meth:`.submit_and_track_sql` 4. :meth:`.get_schema_data` 5. :meth:`.submit` and :meth:`.run`: The two new methods run an SQL statement generated by the approach provided by :class:`SQLBuilder` class. Work with query results -------------------------- 1. :meth:`.get_result` 2. :meth:`.delete_result` 3. :meth:`.rename_exact_result`: Modify the created objects on Cloud Object Storage. 4. :meth:`.rename_exact_result_joblist` 5. :meth:`.delete_empty_objects` 6. :meth:`.list_results` Manage jobs -------------- 1. :meth:`.my_jobs` 2. :meth:`.wait_for_job` 3. :meth:`.process_failed_jobs_until_all_completed` 4. :meth:`.get_job` 5. :meth:`.get_jobs` 6. :meth:`.get_number_running_jobs` 7. :meth:`.get_jobs_with_status` 8. :meth:`.get_jobs_count_with_status` 9. :meth:`.export_job_history` Manage Cloud Object Storage URL ------------------------------- See :class:`.COSClient` Manage table catalog ------------------------ From :class:`.HiveMetastore` 1. :meth:`.show_tables` 2. :meth:`.drop_all_tables` 3. :meth:`.drop_tables` 4. :meth:`.drop_table` 5. :meth:`.create_table` 6. :meth:`.create_partitioned_table`: For partitioned tables. 7. :meth:`.recover_table_partitions` 8. :meth:`.describe_table` Data skipping ---------------------- [Not available yet] Limitations ------------------------------------------------ * The SQL statement string size limit is 200KB. * Maximum five concurrent SQL queries for a standard SQL Query instance. * Maximum duration of one hour for a query job. However, many jobs can be stopped much earlier due to the current mechanism of AIM token timeout, and this token is shared across all current SQL queries. Tips ----- * Combine the SQL query if you can, as there is an overhead (and possibly $ cost) for a REST API request. However, also consider the current limit for a YARN executor of 7.5GB, so design the SQL query accordingly. It is best if the data being accessed is organized with multiple objects of ideal sizes (see below), since this enables more parallelism in the Object Storage. * Complex data can only be stored using Json or Parquet, it is faster with Parquet. * Avoid storing the data with a single object's size larger than 200MB. To check, consider using :meth:`.get_cos_summary` or :meth:`.list_results`. To resolve the issue, consider using the following: + Partition table into multiple buckets/objects type-1: PARTITION INTO BUCKETS/OBJECTS, with maximum allowed for 'x' is 50. + Partition table into multiple buckets/objects type-2: PARTITIONED EVERY ROWS. + Hive-style partitioning: PARTITION BY (col1, col2, ...). * When partitioning according to a column that has NULL values, Spark will use “__HIVE_DEFAULT_PARTITION__” in the object name, for example, /Location=__HIVE_DEFAULT_PARTITION__/. .. code-block:: python sqlClient.list_results(job_id) .. code-block:: console ObjectURL Size Bucket Object 0 cos://s3.us-south.cloud-object-storage.appdomain.cloud/sql-query-cos-access-ts/jobid=a3475263-469a-4e22-b382-1d0ae8f1d1fa 0 sql-query-cos-access-ts jobid=a3475263-469a-4e22-b382-1d0ae8f1d1fa 1 cos://s3.us-south.cloud-object-storage.appdomain.cloud/sql-query-cos-access-ts/jobid=a3475263-469a-4e22-b382-1d0ae8f1d1fa/_SUCCESS 0 sql-query-cos-access-ts jobid=a3475263-469a-4e22-b382-1d0ae8f1d1fa/_SUCCESS 2 cos://s3.us-south.cloud-object-storage.appdomain.cloud/sql-query-cos-access-ts/jobid=a3475263-469a-4e22-b382-1d0ae8f1d1fa/part-00000-e299e734-43e3-4032-b27d-b0d7e93d51c2-c000-attempt_20200318152159_0040_m_000000_0.snappy.parquet 7060033106 sql-query-cos-access-ts jobid=a3475263-469a-4e22-b382-1d0ae8f1d1fa/part-00000-e299e734-43e3-4032-b27d-b0d7e93d51c2-c000-attempt_20200318152159_0040_m_000000_0.snappy.parquet References -------------- * `Sparksql-parser `_: The module contains code with the know how to parse an SQLCloud-specific statement and transform it into a valid SQL statement. * `Grammar `_ * `Tips for data layout `_ * `Data skipping `_ SQLClientTimeSeries Class ================================================ :mod:`ibmcloudsql.sql_query_ts` provides the :py:class:`.SQLClientTimeSeries` class which is derived from :py:class:`.SQLQuery` class. The class provides APIs to make it easier to issue SQL statement that contains time-series functions. Prepare data for time series ------------------------------------- * :meth:`.get_ts_datasource` Let's assume that you create a HIVE catalog table to store the data for fast access that is used as the data source via `table_name` argument. A time series is comprised of the following: 1. The `time_stamp` information. 2. The `observation` information. 3. Category, for example, the `key`. Very often, the raw data is too dense to be digested into a time series. Such data has to be transformed into a coarser timescale, using a proper aggregated function, for example, avg() or max(). The time window during which the summarized data point is to be collected is given by passing a value to `granularity` argument, as in the following example: * `raw`: No change, just extract to a new location. * `per_sec`, or `PT1S`: Per every second. * `per_2sec`, or `PT2S`: Per every two seconds. * `per_min`, or `PT1M`: Per every minute. * `per_5min`, or `PT5M`: Per every five minutes. In general, a valid value to `granularity` follows the following conventions: * 'per_[x]sec' and 'per_[x]min' with x is divisible by 60 * ISO 8601 duration standard The transformed data is then copied and saved into a new location (the time-series data source), which is specified by the following: * `cos_out`: Cloud Object Storage URL (stored as PARQUET). * `num_objects`: Split into multiple objects. * `num_rows`: Split into multiple objects based on number of rows per object. **At the end of the transformation**, the data source to be used for time series creation comprises the following three columns: * `field_name`: Representing category. * `time_stamp`: Representing the point of time at the given granularity. * `observation`: Representing the recorded information. If you use a generic name, you can quickly apply it to any data source. User-friendly SQL string ------------------------------------- Several time-series functions being used in CloudSQL accept parameters in the unit of mili-seconds, which is not user-friendly. The class provides the built-in functionality to map user-friendly name such as `hour`, `day`, `2hour`, `week` into the right unit, before sending the query to CloudSQL service.