Introduction¶
What is tspy?¶
tspy is the Python wrapper to the Apache Spark-powered time-series library.
The library is provided as part of IBM Analytics Engine and can be used via IBM Watson Studio.
To load the library, we run:
import tspy
However, we may also need to load other modules.
What is a time series: STS and MTS?¶
A time series is a sequence of data values measured at successive, though not necessarily regular, points in time. A time point is called a timestamp, and its combination with the associated sequence of data values is called an observation.
We can have one or many data columns at each time point.
In general, you time-series data can be organized as a single time-series (STS) object, as given above, or a multiple time-series (MTS) object, as given below. In MTS, each time-series belongs to a particular grouping value
What is a timestamp in tspy?¶
A timestamp (or time-tick) can be, but must not necessarily be, associated with a time reference system (TRS), which defines the granularity of each timetick and the start time.
The time is stored as a long type. In the simplest scenario, a timestamp is just an integer value, and can be inferred from the index of the data
import tspy
values = [1.0, 2.0, 4.0]
x = tspy.time_series(values)
x
TimeStamp: 0 Value: 1.0
TimeStamp: 1 Value: 2.0
TimeStamp: 2 Value: 4.0
To make the long value human-readable, it needs to be mapped to a time-reference-system (TRS).
TRS is a local, regional or global system used to identify time. A time reference system defines a specific projection for forward and reverse mapping between timestamp and its numeric representation. A common example that most of us are familiar with is UTC time, which maps a timestamp (Jan 1, 2019 12am midnight GMT) into a 64-bit integer value (1546300800000) that captures the number of milliseconds that have elapsed since Jan 1, 1970 12am (midnight) GMT. Generally speaking, the timestamp value is better suited for human readability, while the numeric representation is better suited for machine processing.
What is an observation in tspy?¶
An observation is a combination of a timestamp and a value which can be any, e.g. numeric value, categorical value, or an array of numeric/categorical values.
In Python, a numeric value is represented in the computer using one of the following types:
built-in: int, float
np.ndarray, pd.dataframe: int32, int64, float64
In tspy, an observation is of type Observation
. However, you don’t create it directly. Instead, an observation is created using observation()
API.
import tspy
# simple timestamp ~ an int
x = tspy.observation(1, 1.0)
What is an observation collection in tspy?¶
An observation collection is a sequence of observation, with certain properties.
It is described in class ObservationCollection
. However, we don’t create
it directly from the class. Instead, an observation collection is created using tspy.observations()
.
import tspy
observations = tspy.observations(
tspy.observation(1, 1.0),
tspy.observation(2, 2.0),
tspy.observation(3, 3.0),
tspy.observation(4, 4.0)
)
Another option is to use the single-time-series builder tspy.builder()
to create a single-time-series object,
from which we can extract the observation collections using result()
API.
import tspy
ts_builder = tspy.ts_builder()
ts_builder.add(tspy.observation(1,1))
ts_builder.add(tspy.observation(2,2))
ts_builder.add(tspy.observation(1,3))
observations = ts_builder.result()
What is a segment in tspy?¶
A segment is an observation collection with:
extra information: start time and end time [the start/end time needs not equal to the first/last timestamp]
observations are sorted in order.
It is represented by Segment
class. Generally, we don’t create a segment directly and we don’t store an individual segment separately. Instead, a segment can be created
by segmenting or windowing a TimeSeries object or MultiTimeSeries object, which returns a new type such as SegmentTimeSeries
and SegmentMultiTimeSeries
.
window-based segmentation: segment by sliding a window (of given size) with an offset which can be index-based (
TimeSeries.segment()
,MultiTimeSeries.segment()
) or time-based (TimeSeries.segment_by_time()
,MultiTimeSeries.segment_by_time()
).
# .segment(window_size, offset)
seg_ts = ts.segment(3,2)
segment by silence:
# N/A
anchor-based segmentation: segment by filtering the value to the right segment (
TimeSeries.segment_by()
,TimeSeries.segment_by_anchor()
,TimeSeries.segment_by_changepoint()
,TimeSeries.segment_by_marker()
,MultiTimeSeries.segment_by()
,MultiTimeSeries.segment_by_anchor()
,MultiTimeSeries.segment_by_changepoint()
,MultiTimeSeries.segment_by_marker()
). Example: put into 2 segments (one holds odd values, and one holds even values)
seg_ts = ts.segment_by(lambda x: x % 2 == 0)
seg_ts = ts.segment_by_time(3, 3)
seg_ts = ts.segment_by_anchor(lambda d: d%2 == 0, 1, 1)
What is a (single) time-series (STS) in tspy?¶
It is represented by TimeSeries
class. To create a time-series, however, we use through the builder
which accepts data in different forms. Eventually, the data is converted to an internal representation for STS and MTS.
In memory list
Pandas dataframe
In memory collection of observations (
ObservationCollection
)User defined reader (
TimeSeriesReader
)
import tspy
values = [1.0, 2.0, 4.0]
x = tspy.builder.time_series(values)
x
The example belows shows how to create a simple STS where each index denotes a day after the start time of 1990-01-01 (TRS).
import tspy
import datetime
granularity = datetime.timedelta(days=1)
start_time = datetime.datetime(1990, 1, 1, 0, 0, 0, 0, tzinfo=datetime.timezone.utc)
x = tspy.time_series([1, 2, 3], granularity=granularity, start_time=start_time)
REF: builders.time_series.time_series()
What is a multi-time-series (MTS) in tspy?¶
It is represented by MultiTimeSeries
class. To create a time-series, however, we use through the builder tspy.builders.multi_time_series.
tspy accepts data in different forms which can be converted to an internal representation for STS and MTS.
In memory list
Pandas dataframe
In memory collection of observations (
ObservationCollection
)User defined reader (
TimeSeriesReader
)
data = np.array([['', 'letters', 'timestamp', "numbers"],
['', "a", 1, 27],
['', "b", 3, 4],
['', "a", 5, 17],
['', "a", 3, 7],
['', "b", 2, 45]
])
df = pd.DataFrame(data=data[1:, 1:],
columns=data[0, 1:]).astype(dtype={'letters': 'object', 'timestamp': 'int64', 'numbers': 'float64'})
x = tspy.multi_time_series(df, ts_column='timestamp')
REF: builders.multi_time_series.multi_time_series()
What is a segment time-series (SegTS) in tspy?¶
A segment time-series, represented by either SegmentTimeSeries
or SegmentMultiTimeSeries
class, is a special form of time-series, as a result of segmenting the STS/MTS object.