在Python statsmodels.tsa ARIMA中包含多个季节性术语-Java 学习之路

我试图使用python 2.7.11和优秀的statsmodels.tsa包在python中建模时间序列 . 我的数据包括几周内每小时交通强度的测量结果 . 因此，数据具有多个季节性成分，天数为24小时;周形成168小时 .

此时，statsmodels.tsa中的建模选项未设置为处理多个季节性，因为它们仅允许指定一个季节性因子 . 然而，我遇到了Rob Hyneman关于R的多个季节性的工作 . 他使用傅里叶级数对时间序列的季节性成分进行建模，包括模型中的傅里叶级数，用于对应于每个季节周期的频率 .

我用Welch的方法在我观察到的时间序列中获得信号的功率谱密度，提取信号中与我期望的季节效应相对应的频率的峰值，并使用频率和幅度生成正弦波形模式对应于我在数据中预期的季节性趋势 . 顺便说一下，我认为这允许我绕过Hyneman基于AIC选择k值的步骤，因为我使用观察数据中固有的信号 .

为了确保正弦波与数据中季节性模式的出现相匹配，我将两个正弦波模式的峰值与观测数据中的峰值进行匹配，方法是在24小时周期中选择一个峰值并匹配它出现的小时到表示正弦波的变量的最高值 . 在此之前，我已经检查过每日峰值始终在同一时间发生 .

到目前为止，似乎很好 - 用获得的频率和幅度构造的正弦波图大致对应于观测数据 . 然后我拟合ARIMA（2,0,0）模型，包括两个基于分解的变量作为外生变量 . 此时，我想测试模型的预测效用 . 然而，这是事情变得复杂的地方 .

当我从statsmodels包中使用ARIMA时，我从拟合模型得到的估计形成了一个复制正弦波的模式，但是有一系列值与我的观察相匹配 . 观察结果仍存在很多差异，而季节性趋势并未对此进行解释，这让我相信模型拟合过程中的某些地方并没有按照预期的方式进行 .

不幸的是，我不熟悉时间序列建模的艺术，以了解我的意外结果是否是由于我所包含的外生变量的性质，我应该使用的statsmodels功能，但是我要省略，或者错误的假设是季节性趋势的概念 .

我遇到的一些具体问题是：

是否可以在使用python中的statsmodel的ARIMA模型中包含多个季节性趋势（即基于傅立叶或分解）？

如果将正弦波作为外生变量包含在上面和下面的代码中的模型中，那么使用正弦波重建季节性趋势会导致困难吗？

为什么在下面的代码中指定的模型不能产生更接近地匹配观察数据的预测？

任何帮助深表感谢！

祝福，并提前感谢，

翻转

p.s . ：对不起，如果我的代码示例和数据文件过长 - 因为我不确定是什么原因造成意外结果我认为我会发布整个事情 . 此外，有时候不遵循PEP8道歉 - 我还在学习:)

代码示例：

import os
import re
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from scipy.signal import welch
import operator


# Function which plots rolling mean of data set in order to estimate stationarity
# 'timeseries' = Data to be used for ARIMA modeling
#


def plotmean(timeseries, show=0, path=''):
    rolmean = pd.rolling_mean(timeseries, window=12)
    rolstd = pd.rolling_std(timeseries, window=12)
    fig = plt.figure(figsize=(12, 8))
    orig = plt.plot(timeseries, color='blue', label='Observed scores')
    mean = plt.plot(rolmean, color='red', label='Rolling mean')
    std = plt.plot(rolstd, color='black', label='Rolling SD')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    if show != 0:
        plt.show()
    if path != '':
        plt.savefig(path, format='png', bbox_inches='tight')
    plt.clf()


#
# Function to decompose a function over time f(t) into a spectrum of signal amplitude and frequency
# 'dta' = The dataset used
# 'show' = Whether or not to show plot
# 'path' = Where to store plot, if desirable
#
# Output:
# frequency range and spectral density range
#


def runwelch(dta, show, path):
    nps = (len(dta) / 2) + 8
    nov = nps / 2
    fft = nps
    fs_temp = .0002778
    # Set to 1/3600 because of hourly sampling
    f, Pxx_den = welch(dta, fs=fs_temp, nperseg=nps, noverlap=nov, nfft=fft, scaling="spectrum")
    plt.plot(f, Pxx_den)
    plt.ylim([0.5e-7, 10])
    plt.xlabel('frequency [Hz]')
    plt.ylabel('PSD [V**2/Hz]')
    if show != 0:
        plt.show()
    if path != '':
        plt.savefig(path, format='png', bbox_inches='tight')
    plt.clf()
    return f, Pxx_den


#
# Function which gets amplitude and frequency of n most important periodical cycles, and provides plot
# to visually inspect if they correspond to expected seasonal components.
# 'freq' = output of Welch decomposition
# 'density' = output of Welch decomposition
# 'n' = desired number of peaks to extract
# 'show' = whether to show plots of corresponding sine functions


def getsines(n_obs, freq, density, n, show):
    ftemp = freq
    dtemp = density
    fstore = []
    dstore = []
    astore = []
    fs_temp = .0002778
    # Set to 1/3600 because of hourly sampling
    samplespace = n_obs * 3600
    for a in range(0, n, 1):
        max_index, max_value = max(enumerate(dtemp), key=operator.itemgetter(1))
        dstore.append(max_value)
        fstore.append(ftemp[max_index])
        astore.append(np.sqrt(max_value))
        dtemp[max_index] = 0
    if show == 1:
        for b in range(0, len(fstore), 1):
            sound_sine = sine(fstore[b], samplespace, fs_temp, astore[b], 1)
            plt.plot(sound_sine)
            plt.show()
            plt.clf()
    return fstore, astore


def sine(freq, time_interval, rate, amp):
    w = 2. * np.pi * freq
    t = np.linspace(0, time_interval, time_interval * rate)
    y = amp * np.sin(w * t)
    return y


#
# Function which adapts the calculated sine waves for the returned sines for k = 1 through k = kmax
# 'dta' = Data set


def buildFterms(dta, fstore, astore):
    n = len(fstore)
    n_obs = len(dta)
    fs_temp = .0002778
    # Set to 1/3600 because of hourly sampling
    samplespace = n_obs * 3600 + (24 * 3600)
    # Add one excess day for later fitting of sine waves to peaks
    store = []
    for i in range(0, n, 1):
        tmp = sine(fstore[i], samplespace, 0.0002778, astore[i])
        store.append(tmp)
    k_168_store = store[0]
    k_24_store = store[1]
    k_24 = np.transpose(k_24_store)
    k_168 = np.transpose(k_168_store)
    k_24 = pd.Series(k_24)
    k_168 = pd.Series(k_168)
    dta_ind, dta_val = max(enumerate(dta.iloc[120:143]), key=operator.itemgetter(1))
    # Visually inspect mean plot, select interval which has clear and representative peak, use to determine index.
    k_24_ind, k_24_val = max(enumerate(k_24.iloc[0:23]), key=operator.itemgetter(1))
    # peak in sound level at index 1 is matched by peak in sine wave at index 7. Thus, sound level[0] corresponds to\
    # sine waves[6]
    # print dta_ind, dta_val, k_24_ind, k_24_val
    k_24_sel = k_24[6:1014]
    k_168_sel = k_168[6:1014]
    exog = pd.concat([k_24_sel, k_168_sel], axis=1)
    return exog


#
# Function which takes data, makes a plot of the ACF and PACF, and saves the plot, if needed
# 'x' = Time series data, time indexed, over which to plot the ACF and PACF.
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
# Use output plot to visually interpret necessary parameters p, d, q, and seasonal component for SARIMAX procedure
#


def plotpacf(x, show=0, path=''):
    dflength = len(x)
    nlags = dflength * .80
    fig = plt.figure(figsize=(12, 8))
    ax1 = fig.add_subplot(211)
    fig = sm.graphics.tsa.plot_acf(x.squeeze(), lags=nlags, ax=ax1)
    ax2 = fig.add_subplot(212)
    fig = sm.graphics.tsa.plot_pacf(x, lags=nlags, ax=ax2)
    if show != 0:
        plt.show()
    if path != '':
        plt.savefig(path, format='png', bbox_inches='tight')
    plt.clf()


#
# Function to calculate the Dickey-Fuller test of stationarity
# 'dta' = Time series data, time indexed, over which to test for stationarity using the Dickey-Fuller test.
#

def dftest(dta):
    print 'Results of Dickey-Fuller Test:'
    dftest = sm.tsa.stattools.adfuller(dta, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic', 'p-value', '#Lags Used', 'Number of Observations Used'])
    for key, value in dftest[4].items():
        dfoutput['Critical Value (%s)' % key] = value
    if dfoutput[0] < dfoutput[4]:
        dfoutput['Stationary'] = 'True'
    else:
        dfoutput['Stationary'] = 'False'
    print dfoutput


#
# Function to difference the time series, in order to determine optimal value of d for ACF and PACF
# 'dta' = Data, time series indexed, to be differenced
# 'd' = Order of differencing to be applied
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
#


def diffit(dta, d, show, path=''):
    templist = []
    for i in range(0, (len(dta) - d), 1):
        tempval = dta[i] - dta[i + d]
        templist.append(tempval)
    y = templist[d:len(templist)]
    y = pd.Series(y)
    plotpacf(y, show, path)
    return y


#
# Function to fit the ARIMA model based on parameters obtained from the ACF / PACF plot
# 'dta' = Time series data, time indexed, over which to fit a SARIMAX model.
# 'exog' = Exogenous variables used in ARIMA model
# 'p' = Number of AutoRegressive lags, initially based on the cutoff point of the ACF
# 'd' = Order of differencing based on visual examination of ACF and PACF plots
# 'q' = Number of Moving Average lags, initially based on the utoff point of the PACF
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
#


def runARIMA(dta, exogvar, p, d, q, show=0, path=''):
    mod = sm.tsa.ARIMA(dta, (p, d, q), exogvar)
    results = mod.fit()
    resids = results.resid.values
    summarised = results.summary()
    print summarised
    plotpacf(resids, show, path)
    return results


#
# Function to use fitted ARIMA for prediction of observed data, compare predicted to observed
# 'dta' = Data used in ARIMA prediction
# 'exog' = Exogenous variables fitted in the model
# 'arima' = Result from correctly fitted ARIMA model, likely on the residuals of a decomposed time series
# 'datrng' = Range of dates used for original time series definition, used for specifying predictions
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
#


def ARIMAcompare(dta, exogvar, arima, datrng, show=0, path=''):
    dflength = len(datrng) - 1
    observation = dta
    prediction = arima.predict(start=3, end=dflength, exog=exogvar, dynamic=True)
    df = pd.concat([prediction, observation], axis=1)
    df.columns = ['predicted', 'observed']
    plt.plot(prediction)
    plt.plot(observation)
    if show != 0:
        plt.show()
    if path != '':
        plt.savefig(path, format='png', bbox_inches='tight')
    plt.clf()
    return df


#
# Function use fitted ARIMA model for predictions
# 'pred_hours' = number of hours we want to predict scores for
# 'firsttime' = last timestamp in observations
# 'df' = data frame containing data on which the ARIMA model was previously fitted
# 'results' = output of the modeling procedure
# 'freq' = Frequency of seasonal cycle that was used in decomposition
# 'decomp' = Output of the time series decomposition step
# 'mark' = Amount of hours included in the graph prior to prediction. Set at as close to 2 weeks as possible.
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
#
# Output: A dataframe with observed and predicted values. Note that predictions > 5 time units are considered unreliable
# by modeling standards.
#


def pred(pred_hours, k, df, arima, show=0, path=''):
    n_obs = len(df.index)
    lastdt = df.index[n_obs - 1]
    lastdt = lastdt.to_datetime()
    datrng = pd.date_range(lastdt, periods=(pred_hours + 1), freq='H')
    future = pd.DataFrame(index=datrng, columns=df.columns)
    df = pd.concat([df, future])
    lendf = len(df.index)
    df['predicted'] = arima.predict(start=n_obs, end=lendf, exog=k, dynamic=True)
    print df
    marked = 2 * pred_hours
    df[['predicted', 'observed']].ix[-marked:].plot(figsize=(12, 8))
    if show != 0:
        plt.show()
    if path != '':
        plt.savefig(path, format='png', bbox_inches='tight')
    plt.clf()
    return df[['predicted', 'observed']].ix[-marked:]


dirnow = os.getcwd()
fpath = dirnow + '/sounds_full2.csv'
fhand = open(fpath)
dta = pd.read_csv(fhand, sep=',')
dta_sel = dta.iloc[1248:2256, 2]
#
#
#
# Extract start and end date of measurements from sound data, adding one hour because
# the last hour of the last day is not counted
#
sound_start = dta.iloc[1248, 0]
# The above .iloc value needs to be changed depending on the length of the sound data set being read in.
#
# Establish start date
sound_start = re.sub('-', '/', sound_start)
sound_start = re.sub('_', ' ', sound_start)
sound_start = sound_start + ':00'
sound_start = pd.to_datetime(sound_start, format='%d/%m/%Y %H:%M:%S')
#
# Establish end date
indexer = len(dta.index) - 1
sound_end = dta.iloc[indexer, 0]
sound_end = re.sub('-', '/', sound_end)
sound_end = re.sub('_', ' ', sound_end)
sound_end = sound_end + ':00'
sound_end = pd.to_datetime(sound_end, format='%d/%m/%Y %H:%M:%S')
sound_diff = sound_end - sound_start
#
# Derive number of periods and create data set
num_observed = (sound_diff.days * 24) + ((sound_diff.seconds + 3600) / 3600)
usedates3 = pd.date_range(sound_start, periods=num_observed, freq='H')
usedates3 = pd.Series(usedates3)
usedates3.index = dta_sel.index
timedfreq = pd.concat([usedates3, dta_sel], axis=1)
timedfreq.index = timedfreq.iloc[:, 0]
freqset = pd.Series(timedfreq.iloc[:, 1])
filepath = dirnow + '/Sound_RollingMean.png'
plotmean(freqset, 0, filepath)
# Plotted mean shows recurring (seasonal) trends at periods of 24 hours and 168 hours.
# This means a seasonal model is needed that accounts for both of these influences
# To do so, Fourier series representing the 24- and 168 hour seasonal trends can be added to the ARIMA-model
#
#
#
#
# Check for stationarity of data
#
dftest(freqset)
# Time series can be considered stationary
#
#
#
# Establish frequencies and amplitudes with which to fit ARIMA model
#
# Decompose signal into frequency and amplitude
#
filepath = dirnow + "/Welch.png"
f, Pxx_den = runwelch(freqset, 0, filepath)
#
# Obtain sine wave parameters, optionally view test plots to check periodicity
freqs, amplitudes = getsines(len(freqset), f, Pxx_den, 2, 0)
#
# Use parameters to build Fourier series for observed data with varying values for k
exog_sel = buildFterms(freqset, freqs, amplitudes)
exog_sel.index = freqset.index
#
# fit ARIMA model, plot ACF and PACF for fitted model, check for effects orders of differencing on residuals
#
filepath = dirnow + '/Sound_resid_ACFPACF.png'
Sound_ARIMA = runARIMA(freqset, exog_sel, 1, 0, 0, show=0, path=filepath)
sound_residuals = Sound_ARIMA.resid
#
# Plot various acf / pacf plots of differencing given model residuals
filepath = dirnow + '/Sound_resid_ACFPACF_d1.png'
tempdta_d1 = diffit(sound_residuals, 1, 0, filepath)
filepath = dirnow + '/Sound_resid_ACFPACF_d2.png'
tempdta_d2 = diffit(sound_residuals, 2, 0, filepath)
# Of the two differenced models, one order of differencing seems to yield the best results
# Visual inspection of plots and model output suggests model with p = 2, d = 0 or p = 1, d = 1 to be optimal.
#
#
#
# Find optimal form of model
filepath = dirnow + '/Sound_resid_ACFPACF_200.png'
Sound_ARIMA_200 = runARIMA(freqset, exog_sel, 2, 0, 0, show=0, path=filepath)
filepath = dirnow + '/Sound_resid_ACFPACF_110.png'
Sound_ARIMA_110 = runARIMA(freqset, exog_sel, 1, 1, 0, show=0, path=filepath)
# Based on model output and ACF / PACF plot comparison for 'Sound_resid_ACFPACF_110.png' and \
# 'Sound_resid_ACFPACF_200.png', the model parameters for p = 2, d = 0, q = 0 are closer to optimal.
#
# Use selected model to predict observed values
filepath = dirnow + '/Sound_PredictObserved.png'
sound_comparison = ARIMAcompare(freqset, exog_sel, Sound_ARIMA_200, usedates3, 0, filepath)
#
# Predict values and store for Sound dataset
filepath = dirnow + '/Sound_PredictFuture.png'
sound_storepred = pred(168, exog_sel.iloc[0:170, :], sound_comparison, Sound_ARIMA_200, 0, filepath)

在Python statsmodels.tsa ARIMA中包含多个季节性术语

数据文件

相关问题