对齐多年的每日数据-Java 学习之路

从当天不同时间测量的多年温度记录开始，我想得到一个矩形的日平均数，每行代表一年的数据 .

数据看起来像这样

temperature.head()

date
1996-01-01 00:00:00     7.39
1996-01-01 03:00:00     6.60
1996-01-01 06:00:00     7.39
1996-01-01 09:00:00     9.50
1996-01-01 12:00:00    11.00
Name: temperature, dtype: float64

我计算了每日平均值

import pandas as pd
daily = temperature.groupby(pd.TimeGrouper(freq='D')).mean()

哪个收益率

daily.head()

date
1996-01-01     9.89625
1996-01-02    10.73625
1996-01-03     6.98500
1996-01-04     5.62250
1996-01-05     8.84625
Freq: D, Name: temperature, dtype: float64

现在最后一部分我想到了类似的东西

yearly_daily_mean = daily.groupby(pd.TimeGrouper(freq='12M', closed="left"))

但这里有一些问题 .

我需要放弃数据的尾部而不是一整年 .
如果数据丢失会怎样？
如何应对闰年？
下一步是什么？也就是说，如何“堆叠”（在numpy 's, not pandas'意义上）多年的数据？

我在用

array_temperature = np.column_stack([group[1] for group in yearly_daily_mean if len(group[1]) == 365])

但应该有更好的方法 .

作为附属问题，我如何选择数据年份的开始日期？

3 回答

如果我理解正确，您希望将日常工作的时间序列（已经计算过）重新整形为矩形数据框，其中不同的日期为列，不同的年份为行 .
这可以通过pandas整形功能轻松实现，例如使用pivot：

一些虚拟数据：

In [45]: index = pd.date_range(start=date(1996, 1,1), end=date(2010, 6, 30), freq='D')

In [46]: daily = pd.DataFrame(index=index, data=np.random.random(size=len(index)), columns=['temperature'])

首先，我添加一年中和每年的列：

In [47]: daily['year'] = daily.index.year

In [48]: daily['day'] = daily.index.dayofyear

In [49]: daily.head()
Out[49]:
            temperature  year  day
1996-01-01     0.081774  1996    1
1996-01-02     0.694968  1996    2
1996-01-03     0.478050  1996    3
1996-01-04     0.123844  1996    4
1996-01-05     0.426150  1996    5

现在，我们可以重塑这个数据帧：

In [50]: daily.pivot(index='year', columns='day', values='temperature')
Out[50]:
day        1         2      ...          365       366
year                        ...
1996  0.081774  0.694968    ...     0.679461  0.700833
1997  0.043134  0.981707    ...     0.009357       NaN
1998  0.257077  0.297290    ...     0.701941       NaN
...        ...       ...    ...          ...       ...
2008  0.047145  0.750354    ...     0.996396  0.761159
2009  0.348667  0.827057    ...     0.881424       NaN
2010  0.269743  0.872655    ...          NaN       NaN

[15 rows x 366 columns]

回复于 2024-05-18T14:11:51+08:00

我就是这样做的 . 非常简单：用你想要的确切形状创建一个新的df，然后用你想要的东西填充它 .

from datetime import datetime
import numpy as np
import pandas as pd

# This is my re-creation of the data you have. (I'm calling it df1.)
# It's essential that your date-time be in datetime.datetime format, not strings
byear = 1996  # arbitrary
eyear = 2005  # arbitrary
obs_n = 50000 # arbitrary
start_time = datetime.timestamp(datetime(byear,1,1,0,0,0,0))
end_time = datetime.timestamp(datetime(eyear,12,31,23,59,59,999999))
obs_times = np.linspace(start_time,end_time,num=obs_n)
index1 = pd.Index([datetime.fromtimestamp(i) for i in obs_times])
df1 = pd.DataFrame(data=np.random.rand(obs_n)*20,index=index1,columns=['temp'])
# ^some random data

# Here is the new empty dataframe (df2) where you will put your daily averages. 
index2 = pd.Index(range(byear,eyear+1))
columns2 = range(1,367) # change to 366 if you want to assume 365-day years
df2 = pd.DataFrame(index=index2,columns=columns2)

# Some quick manipulations that allow the two dfs' indexes to talk to one another.
df1['year'] = df1.index.year # a new column with the observation's year as an integer
df1['day'] = df1.index.dayofyear # a new column with the day of the year as integer
df1 = df1.reset_index().set_index(['year','day'])

# Now get the averages for each day and assign them to df2.
for year in index2:
    for day in columns2[:365]: # for all but the last entry in the range
        df2.loc[year,day] = df1.loc[(year,day),'temp'].mean()
    if (year,366) in df1.index: # then if it's a leap year...
        df2.loc[year,366] = df1.loc[(year,366),'temp'].mean()

如果你不希望最后的df在第366天有任何空值，那么你可以删除最后的if语句，然后重写 columns2 = range(1,366) ，然后df2将拥有所有非空值（假设至少有在观察到的时间段内每天进行一次测量） .

回复于 2024-05-18T14:11:51+08:00

假设您已经从高频数据中获得每日平均值（包含 pd.DateTimeIndex ），原因如下：

daily = temperature.groupby(pd.TimeGrouper(freq='D')).mean()

IIUC，您希望将每日平均值转换为 DataFrame ，每行具有相同数量的列以捕获年度数据 . 当瞄准相同数量的列时，你提到闰年是一个潜在的问题 .

我可以想象两种方法：

选择每个 row 的天数 - 可能为365.为每个 row 选择365个连续每日数据点的滚动块，并为每个块对齐 index .
选择多年的数据，填补闰年的空白，并按 MM-DD 或一年中的天数对齐 .

以20年半的每日随机 data 作为模拟日平均温度：

index = pd.date_range(start=date(1995, 1,1), end=date(2015, 6, 30), freq='D')
df = pd.DataFrame(index=index, data=np.random.random(size=len(index)) * 30, columns=['temperature'])

df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 7486 entries, 1995-01-01 to 2015-06-30
Freq: D
Data columns (total 1 columns):
temperature    7486 non-null float64
dtypes: float64(1)
memory usage: 117.0 KB
None

df.head()

            temperature
1995-01-01     4.119212
1995-01-02    27.107131
1995-01-03    26.704931
1995-01-04     7.430203
1995-01-05     4.230398

df.tail()

            temperature
2015-06-26    10.902779
2015-06-27     8.494378
2015-06-28    17.800131
2015-06-29    19.543815
2015-06-30    16.390435

这是第一种方法的解决方案：

使用 .groupby(pd.TimeGrouper('365D')) 连续365天选择块，并将每个平均每个平均值的 groupby 对象返回 pd.DataFrame ，每个序列从 0 到 364 运行 integer index ：

aligned = df.groupby(pd.TimeGrouper(freq='365D')).apply(lambda x: pd.DataFrame(x.squeeze().tolist())) # .squeeze() converts single columns `DataFrame` to pd.Series

要对齐21个数据块，只需转置 pd.DataFrame ，它们将在 columns, with the start date of each sequence in the index . This operation will produce an extra index , and the last index中对齐 integer index将有一些丢失的数据 . 清理两者：

aligned.dropna().reset_index(-1, drop=True)

得到 [20 x 365] DataFrame 如下：

DatetimeIndex: 20 entries, 1995-01-01 to 2013-12-27
Freq: 365D
Columns: 365 entries, 0 to 364
dtypes: float64(365)
memory usage: 57.2 KB

                  0          1          2          3          4          5    \
1995-01-01  29.456090  25.313968   4.146206   5.347690  25.767425  11.978152   
1996-01-01  25.585481  26.846486   8.336905  16.749842   6.247542  17.723733   
1996-12-31  23.410462  10.168599   5.601917  11.996500   8.650726  23.362815   
1997-12-31   7.586873  23.882106  22.145595   3.287160  21.642547   1.949321   
1998-12-31  14.691420   3.611475  28.287327  25.347787  13.291708  20.571616   
1999-12-31  25.713866  17.588570  18.562117  19.420944  12.406293  11.870750   
2000-12-30   5.099561  17.894763  21.168223   4.786461  24.521417  21.443607   
2001-12-30  11.791223   8.352493  12.731769   0.459697  20.680396  27.554783   
2002-12-30   3.785876   0.359850  20.828764  15.376991  14.086626   0.477615   
2003-12-30  23.633243  12.726250   8.197824  16.355956   8.094145   1.410746   
2004-12-29   1.139949   4.161267   9.043062  14.109888  13.538735   1.566002   
2005-12-29  25.504224  19.346419   3.300641  26.933084  23.634321  18.323450   
2006-12-29  10.535785   9.168498  27.222106  11.962343  10.004678  23.893257   
2007-12-29  27.482856   6.910670   6.033291  12.673530  26.362971   4.492178   
2008-12-28  11.152316  25.233664  22.124299  11.012285   1.992814  25.542204   
2009-12-28  23.131021  16.363467   1.242393  10.387653   4.858851  26.553950   
2010-12-28  13.134843   9.195658  19.075850  28.539387   3.075934   8.089347   
2011-12-28  28.860275  10.121573   0.663906  19.687892  29.376377  11.488446   
2012-12-27   7.644073  19.649330  25.497595   6.592940   8.879444  17.733670   
2013-12-27  11.713996   2.602284   3.835302  22.244623  27.279810  14.144943   

                  6          7          8          9      ...            355  \
1995-01-01   8.210005   8.129146  28.798472  25.646924    ...      24.177163   
1996-01-01   0.481487  16.772357   3.934185  22.640157    ...      23.340931   
1996-12-31  10.813812  16.276504   3.422665  14.916229    ...      13.817015   
1997-12-31  19.184753  28.628326  22.134871  12.721064    ...      23.905483   
1998-12-31   2.839492   7.889141  17.951959  25.233585    ...      28.002751   
1999-12-31   6.958672  26.335427  23.361470   5.911806    ...       7.778412   
2000-12-30   8.405042  25.229016  19.746462  15.332004    ...       5.703830   
2001-12-30   0.558788  15.457327  20.987186  25.452723    ...      29.771372   
2002-12-30  19.002685  26.455754  25.468178  25.383786    ...      14.238987   
2003-12-30  22.984328  15.934398  25.361599  12.221306    ...       1.189949   
2004-12-29  22.121901  21.421103  26.175702  16.040881    ...      19.945408   
2005-12-29   2.557901  15.193412  27.049389   4.825570    ...       7.629859   
2006-12-29   8.582602  26.037375   0.933591  13.469771    ...      29.453932   
2007-12-29  29.437921  26.470153   9.917871  16.875801    ...       5.702116   
2008-12-28   3.809633  10.583385  18.029571   0.440077    ...      11.337894   
2009-12-28  24.406696  28.294553  19.929563   4.683991    ...      25.697446   
2010-12-28  29.765551  16.716723   6.467946  10.998447    ...      26.988863   
2011-12-28  28.962746  11.407137   9.957111   4.502521    ...      14.606937   
2012-12-27   1.374502   5.571244  11.212960   9.949830    ...      23.345868   
2013-12-27  26.373866   4.781510  16.828510  10.280078    ...       0.552726   

                  356        357        358        359        360        361  \
1995-01-01  13.511951  10.126835  28.121730  23.275360  11.785242  27.907039   
1996-01-01  13.362737  14.336780  24.114908  28.479688   8.509069  17.408937   
1996-12-31  19.192674   1.146844  27.499688   7.090407   2.777819  22.826814   
1997-12-31  21.502186  10.495148  21.786895  12.229181   8.068271   6.522108   
1998-12-31  21.338355  11.978265   9.186161  21.053924   3.033370  29.934703   
1999-12-31   5.960120  20.325684   0.915052  15.059979  12.194240  20.138567   
2000-12-30  11.883186   2.764768  27.324304  29.630706  21.852058  20.416199   
2001-12-30   7.802891  25.384479   9.044486   8.809446   7.606603   6.051890   
2002-12-30   7.362494   8.940783   5.259984   7.035818  24.094134   7.197113   
2003-12-30  25.596902   9.756372   6.345198   1.520188  22.752717   3.470268   
2004-12-29  26.789064   9.708466  18.287838  21.134643  29.862135  19.926086   
2005-12-29  26.398394  24.717514  16.606042  28.189245  24.574806  14.297410   
2006-12-29   8.795342  18.019536  16.579878  20.368811  22.052442  26.393676   
2007-12-29   8.696240  25.901889  16.410934  15.274897  14.365867  10.523388   
2008-12-28  18.581513  25.974784  21.025297  10.521118   5.864974   2.373023   
2009-12-28  14.437944  21.717456   4.017870  14.024522   0.959989  17.215403   
2010-12-28  11.426540  13.751451   4.664761  15.373878   7.731613   7.269089   
2011-12-28   1.952897   9.406866  28.957258  20.239517  11.156958  29.238761   
2012-12-27   7.588643  21.186675  17.348911   1.354323  13.918083   3.034123   
2013-12-27  22.916065   2.089675  22.832061  14.787841  25.697875  14.087893   

                  362        363        364  
1995-01-01  13.107523  10.740551  20.511825  
1996-01-01  25.016219  17.885332   2.438875  
1996-12-31  24.692327   0.221760   6.749919  
1997-12-31  24.856169   0.930019  22.603652  
1998-12-31  18.361414  13.587695  25.161495  
1999-12-31   0.512120  26.482288   1.035197  
2000-12-30  15.401012  28.334219   5.965014  
2001-12-30  10.292213  10.951915   8.270319  
2002-12-30  21.945734  27.076438   6.795688  
2003-12-30  14.788929  19.456459  11.216835  
2004-12-29   7.086443  25.463503  17.549196  
2005-12-29  12.252487  29.081547  25.507369  
2006-12-29   0.012617   0.086186  17.421958  
2007-12-29   4.191633  21.588891   7.516187  
2008-12-28  26.194288  20.500256  24.876032  
2009-12-28  28.445254  27.338754   7.849899  
2010-12-28  28.888573  26.801262  23.117027  
2011-12-28  19.871547  20.324514  18.369134  
2012-12-27  15.907752   9.417700   4.922940  
2013-12-27  21.132385  20.707216   5.288128  

[20 rows x 365 columns]

如果您只想收集数据的年份并按日期对齐，那么非闰年的缺失日期大约为60（而不是366），您可以：

df.groupby(pd.TimeGrouper(freq='A')).apply(lambda x: pd.DataFrame(x.squeeze().tolist()).T).reset_index(-1, drop=True).iloc

                  0          1          2          3          4          5    \
1995-12-31   1.245796  28.487530   0.574299  10.033485  19.221512   8.718728   
1996-12-31  12.258653   3.864652  25.237088  13.982809  24.494746  13.822292   
1997-12-31  22.239412   4.796824  21.389404  11.151171  25.577368   1.754948   
1998-12-31  24.968287   2.089894  25.888487  28.291714  19.115844  24.426285   
1999-12-31   9.285363  19.339405  26.012193   3.243394  25.176499   8.766770   
2000-12-31  26.996573  26.404391   1.793644  21.314488  13.118279  26.703532   
2001-12-31  16.303829  14.021771  20.828238  11.427195   3.099290  18.730795   
2002-12-31  14.614617  10.694258   5.226033  24.900849  17.395822  22.154202   
2003-12-31  10.564132   8.267639   7.778573  26.704936   5.671499   0.470963   
2004-12-31  22.649623  15.725867  18.445629   7.529507  11.868134  10.965534   
2005-12-31   2.406615   9.709624  23.284616  11.479254  23.814725   1.656826   
2006-12-31  19.164459  23.177769  16.091672  28.936777  28.636072   4.838555   
2007-12-31  12.371377   3.417582  21.067689  25.493921  25.410295  15.526614   
2008-12-31  29.080385   4.653984  16.567333  24.248921  27.338538   9.353291   
2009-12-31  29.608734   6.046593  22.738628  22.631714  26.061903  21.217846   
2010-12-31  27.458254  15.146497  18.917073   8.473955  26.782767  10.891648   
2011-12-31  25.433759   8.959650  14.343507  16.249726  17.031174  12.944418   
2012-12-31  22.940797   4.791280  11.765939  25.925645   3.649440  27.483407   
2013-12-31  11.684391  27.701678  27.423083  27.656086   9.374896  14.250936   
2014-12-31  23.660098  27.768960  25.753294   3.014606  23.330226  17.570492   

                  6          7          8          9      ...            356  \
1995-12-31  17.079137  26.100763  12.376462  12.315219    ...      16.910185   
1996-12-31  26.718277  10.349412  12.940624   9.453769    ...      19.235435   
1997-12-31  20.201528  22.895552   1.443243  20.584140    ...      29.665815   
1998-12-31  21.493163  16.724328   5.946833  15.230762    ...       2.617883   
1999-12-31   9.776013  13.381424  11.028295   1.905501    ...       7.200409   
2000-12-31   9.773097  14.565345  22.578398   0.688273    ...      18.119020   
2001-12-31   1.095308  14.817514  25.652418   8.327481    ...      15.385689   
2002-12-31  29.744794  15.545211   6.373948  13.451261    ...       7.446414   
2003-12-31  14.971959  25.948332  21.596976   5.355589    ...      23.676867   
2004-12-31   0.604113   2.858745   0.120340  19.365223    ...       0.336213   
2005-12-31   6.260722   9.819337  19.573953  11.132919    ...      26.107100   
2006-12-31  10.341241  15.126506   3.349634  23.619127    ...      15.508680   
2007-12-31  20.033540  22.103483   7.674852   1.263726    ...      15.148461   
2008-12-31  28.233973  27.982105  17.037928   5.389418    ...       8.773618   
2009-12-31   4.400039   7.284556  11.825382   4.201001    ...       6.734423   
2010-12-31  26.086305  26.275027   8.069376  19.200344    ...      19.056528   
2011-12-31  29.215028   0.985623   4.813478   7.752540    ...      14.395423   
2012-12-31   4.690336   9.618306  25.492041  10.400292    ...       8.853903   
2013-12-31   8.227096  11.013431   0.996911  15.276574    ...      26.227540   
2014-12-31  23.440591  16.544698   2.263684   3.919315    ...      24.987387   

                  357        358        359        360        361        362  \
1995-12-31  24.791125  21.443534  21.092439   8.289222   9.745293  20.084046   
1996-12-31   2.632656   2.102163  24.828437  18.104255   7.951859   3.266873   
1997-12-31  11.246534  14.086539  29.635519  19.518642  24.086108   6.041870   
1998-12-31  29.961162   9.924863   9.401790  25.597344  13.885467  16.537406   
1999-12-31   3.057125  15.241720   8.472388   3.248545  11.302522  19.283612   
2000-12-31  22.999729  17.518504  10.058249   2.953903  10.167712  17.309525   
2001-12-31  18.267445  23.205300  25.658591  19.915797  10.704525  26.604965   
2002-12-31  11.497110   3.641206   9.693428  24.571510   6.438652  29.280098   
2003-12-31  23.931401  19.967615   0.307896   0.385782   0.579257   7.534806   
2004-12-31  21.321146   9.224362   1.703842   6.180944  28.173925   5.178336   
2005-12-31  17.990409  28.746179   2.524899  10.555224  25.487723  19.877390   
2006-12-31   9.748760  29.069966   1.717175   3.283069   9.615215  25.787787   
2007-12-31  29.772930  20.892030  16.597493  20.079373  17.320327   9.583089   
2008-12-31  22.787891  26.636413  13.872783  29.305847  21.287553   1.263788   
2009-12-31   1.574188  23.172773   0.967153   1.928999  12.201354   0.125939   
2010-12-31  20.566125   0.429552   4.413156  16.106451  27.745684  18.280928   
2011-12-31   9.348584   2.604338  23.397221   7.378340  16.757224  29.364973   
2012-12-31   4.704570   7.278321  19.034622  24.597784  13.694635  15.912901   
2013-12-31  21.657446  14.110146  23.976991   8.203509  20.083490   4.471119   
2014-12-31  14.465823   9.105391  15.984162   6.796756   8.232619  18.761280   

                  363        364        365  
1995-12-31  28.165022   9.735041        NaN  
1996-12-31  11.644543   4.139818   5.420238  
1997-12-31   2.500165  18.290531        NaN  
1998-12-31  23.856333  10.064951        NaN  
1999-12-31   3.090008  26.203395        NaN  
2000-12-31  22.216599  27.942821   0.791318  
2001-12-31  25.682003   4.766435        NaN  
2002-12-31  19.785159  28.972659        NaN  
2003-12-31  15.692168  21.388069        NaN  
2004-12-31   9.079675   7.392328  12.583179  
2005-12-31  18.202333  21.895494        NaN  
2006-12-31  20.951937  26.220226        NaN  
2007-12-31  23.603166  28.165377        NaN  
2008-12-31  20.532933   9.401494  25.296916  
2009-12-31   5.879644  10.377044        NaN  
2010-12-31   0.436284  20.875852        NaN  
2011-12-31  13.205290   6.832805        NaN  
2012-12-31  23.253155  17.760731  23.270751  
2013-12-31  19.807798   2.453238        NaN  
2014-12-31  12.817601  11.756561        NaN  

[20 rows x 366 columns]

回复于 2024-05-18T14:11:51+08:00

对齐多年的每日数据

3 回答

相关问题