Pandas之时间序列应用

pandas可以简单高效的进行重新采样通过频率转换(例如:将秒级数据转换成五分钟为单位的数据)。这常见与金融应用中，但是不限于此。详情请查看Time Series section，官方网页：http://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries。具体更多用法请参考官方的API，但是本文尽可能的详细描述时间函数的用法。

本文参考：

http://pyzh.readthedocs.io/en/latest/python-pandas.html

http://blog.csdn.net/ly\_ysys629/article/details/73822716

http://www.codeweblog.com/pandas-%E6%97%B6%E9%97%B4%E5%BA%8F%E5%88%97%E6%93%8D%E4%BD%9C/

1、Python库的日期组件

python标准库包含于日期（date）和时间（time）的数据类型，datetime、time以及calendar模块会被经常用到。

datetime以毫秒形式存储日期和时间，datetime.timedelta表示两个datetime对象之间的时间差。

给datetime对象加上或减去一个或多个timedelta，会产生一个新的对象：

from datetime import datetime
from datetime import timedelta

now = datetime.now()
now

datetime 函数：datetime(year, month, day[, hour[, minute[, second[, microsecond[,tzinfo]]]]])

datetime.datetime(2017, 6, 27, 15, 56, 56, 167000)

delta = now - datetime(2017,6,27,10,10,10,10)

delta

datetime.timedelta(0, 20806, 166990)

print(delta.days)
print(delta.seconds)
print(delta.microseconds)

244
23259
604460

datetime模块中的数据类型

类型	说明
date	以公历形式存储日历日期（年、月、日）
time	将时间存储为时、分、秒、毫秒
datetime	存储日期和时间
timedelta	表示两个datetime值之间的差（日、秒、毫秒）

datetime 对象间的减法运算会得到一个 timedelta 对象，表示一个时间段。

标准库中字符串和时间对象的相互转换

datetime 对象与它所保存的字符串格式时间戳之间可以互相转换。str()函数是可用的，但更推荐datetime.strptime()方法。这个方法可以实现双向转换。

str(now)
now.strftime('%Y-%m-%d')
datetime.strptime('2010-01-01','%Y-%m-%d')

2018-02-26 16:37:49.604470
2018-02-26
2010-01-01 00:00:00

如%Y这种格式代表了某种具体的意义，但用着很麻烦。因此可以使用一个名为 dateutil 第三方包的 parser.parse() 函数实现自动转义，它几乎可以解析任何格式（这也可能会带来麻烦）。

>>> from dateutil.parser import parse
>>> parse('01-02-2010',dayfirst=True)
datetime.datetime(2010, 2, 1, 0, 0)
>>> parse('01-02-2010')
datetime.datetime(2010, 1, 2, 0, 0)
>>> parse('55')
datetime.datetime(2055, 6, 17, 0, 0)

2、Pandas的常用日期组件

pandas 的 TimeStamp

pandas 最基本的时间日期对象是一个从 Series 派生出来的子类 TimeStamp，这个对象与 datetime 对象保有高度兼容性，可通过pd.to_datetime()函数转换。（一般是从 datetime 转换为 Timestamp）：

print(pd.to_datetime(now))
print(pd.to_datetime(np.nan))

2018-02-26 16:47:39.324446
NaT

pandas 最基本的时间序列类型就是以时间戳（TimeStamp）为 index 元素的 Series 类型。

dates = [datetime(2011,1,1),datetime(2011,1,2),datetime(2011,1,3)]
ts = pd.Series(np.random.randn(3),index=dates)
print(ts)
print(type(ts))
print(ts.index) 
print(ts.index[0])

2011-01-01    0.112769
2011-01-02   -0.387973
2011-01-03    0.713669

dtype: float64
<class 'pandas.core.series.Series'>

DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq=None)

2011-01-01 00:00:00

【回忆在讲02章的时候，i提到过时间索引】，时间序列之间的算术运算会自动按时间对齐。

索引、选取、子集构造

时间序列只是 index 比较特殊的 Series ，因此一般的索引操作对时间序列依然有效。其特别之处在于对时间序列索引的操作优化。如使用各种字符串进行索引：

print(ts['20110101'])
print(ts['2011-01-01'])
print(ts['01/01/2011'])

0.112769454552
0.112769454552
0.112769454552

对于较长的序列，还可以只传入 “年” 或 “年月” 选取切片:

print(ts)

#在python3中这种法报错
#print(ts['2012'])

print(ts['2011-1-2':'2012-12'])

2011-01-01    0.112769
2011-01-02   -0.387973
2011-01-03    0.713669
dtype: float64

2011-01-02   -0.387973
2011-01-03    0.713669
dtype: float64

除了这种字符串切片方式外，还有一种实例方法可用：ts.truncate(after='2011-01-03')。

值得注意的是，切片时使用的字符串时间戳并不必存在于 index 之中，如ts.truncate(before='3055')也是合法的。

日期的范围、频率以及移动

[ ] 生成日期范围

pd.date_range()可用于生成指定长度的 DatetimeIndex。参数可以是起始结束日期，或单给一个日期，加一个时间段参数。日期是包含的。

print(pd.date_range('20100101','20100110'))
print(pd.date_range(start='20100101',periods=10))
print(pd.date_range(end='20100110',periods=10))

DatetimeIndex(['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04',
               '2010-01-05', '2010-01-06', '2010-01-07', '2010-01-08',
               '2010-01-09', '2010-01-10'],
              dtype='datetime64[ns]', freq='D')

DatetimeIndex(['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04',
               '2010-01-05', '2010-01-06', '2010-01-07', '2010-01-08',
               '2010-01-09', '2010-01-10'],
              dtype='datetime64[ns]', freq='D')

DatetimeIndex(['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04',
               '2010-01-05', '2010-01-06', '2010-01-07', '2010-01-08',
               '2010-01-09', '2010-01-10'],
              dtype='datetime64[ns]', freq='D')

默认情况下，date_range 会按天计算时间点。这可以通过 freq 参数进行更改，如 “BM” 代表 bussiness end of month。

pd.date_range('20100101','20100601',freq='BM')

DatetimeIndex(['2010-01-29', '2010-02-26', '2010-03-31', '2010-04-30',
               '2010-05-31'],
              dtype='datetime64[ns]', freq='BM')

[ ] 频率和日期偏移量

pandas 中的频率是由一个基础频率和一个乘数组成的。基础频率通常以一个字符串别名表示，如上例中的 “BM”。对于每个基础频率，都有一个被称为日期偏移量（date offset）的对象与之对应。可以通过实例化日期偏移量来创建某种频率，使用前面提过的字符串别名来创建频率就可以了：

pd.date_range('00:00','12:00',freq='1h20min')

DatetimeIndex(['2018-02-26 00:00:00', '2018-02-26 01:20:00',
               '2018-02-26 02:40:00', '2018-02-26 04:00:00',
               '2018-02-26 05:20:00', '2018-02-26 06:40:00',
               '2018-02-26 08:00:00', '2018-02-26 09:20:00',
               '2018-02-26 10:40:00', '2018-02-26 12:00:00'],
              dtype='datetime64[ns]', freq='80T')

可用的别名，可以通过 help() 或文档来查询，这里就不写了。

[ ] 移动（超前和滞后）数据

移动

（shifting）指的是沿着时间轴将数据前移或后移。Series 和 DataFrame 都有一个.shift()方法用于执行单纯的移动操作，index 维持不变：

dates = [datetime(2011,1,1),datetime(2011,1,2),datetime(2011,1,3)]
ts = pd.Series(np.random.randn(3),index=dates)

print(ts)
print(ts.shift(2))
print(ts.shift(-2))

2011-01-01   -0.392333
2011-01-02    0.694372
2011-01-03   -1.038893
dtype: float64

2011-01-01         NaN
2011-01-02         NaN
2011-01-03   -0.392333
dtype: float64

2011-01-01   -1.038893
2011-01-02         NaN
2011-01-03         NaN
dtype: float64

上例中因为移动操作产生了 NA 值，另一种移动方法是移动 index，而保持数据不变。这种移动方法需要额外提供一个 freq 参数来指定移动的频率：

print(ts.shift(2,freq='D'))
print(ts.shift(2,freq='3D'))

2011-01-03   -0.392333
2011-01-04    0.694372
2011-01-05   -1.038893
Freq: D, dtype: float64

2011-01-07   -0.392333
2011-01-08    0.694372
2011-01-09   -1.038893
Freq: D, dtype: float64

[ ] 时期及其算术运算

本节使用的时期（period）概念不同于前面的时间戳（timestamp），指的是一个时间段。但在使用上并没有太多不同，

pd.Period类的构造函数仍需要一个时间戳，以及一个 freq 参数。freq 用于指明该 period 的长度，时间戳则说明该 period 在公园时间轴上的位置。

p = pd.Period(2010,freq='M')
print(p)
print(p + 2)

2010-01
2010-03

上例中我给 period 的构造器传了一个 “年” 单位的时间戳和一个 “Month” 的 freq，pandas 便自动把 2010 解释为了 2010-01。

period_range 函数可用于创建规则的时间范围：

pd.period_range('2010-01','2010-05',freq='M')

PeriodIndex(['2010-01', '2010-02', '2010-03', '2010-04', '2010-05'], dtype='period[M]', freq='M')

PeriodIndex 类保存了一组 period，它可以在任何 pandas 数据结构中被用作轴索引：

Series(np.random.randn(5),index=pd.period_range('201001','201005',freq='M'))

2010-01    0.173198
2010-02    0.713815
2010-03   -2.374650
2010-04   -0.456464
2010-05   -1.146443
Freq: M, dtype: float64

[ ] 时期的频率转换

Period 和 PeriodIndex 对象都可以通过其.asfreq(freq, method=None, how=None)方法被转换成别的频率。

>>> p = pd.Period('2007',freq='A-DEC')
>>> p.asfreq('M',how='start')
Period('2007-01', 'M')
>>> p.asfreq('M',how='end')
Period('2007-12', 'M')
>>> ts = Series(np.random.randn(1),index=[p])
>>> ts
2007   -0.112347
Freq: A-DEC, dtype: float64
>>> ts.asfreq('M',how='start')
2007-01   -0.112347
Freq: M, dtype: float64

[ ] 时间戳与时期间相互转换

以时间戳和以时期为 index 的 Series 和 DataFrame 都有一对.to_period()和to_timestamp(how='start')

方法用于互相转换 index 的类型。因为从 period 到 timestamp 的转换涉及到一个取端值的问题，所以需要一个额外的 how 参数，默认为 'start'：

ts = pd.Series(np.random.randn(5),index=pd.period_range('201001','201005',freq='M'))
print(ts)
print(ts.to_timestamp())
print(ts.to_timestamp(how='end'))
print(ts.to_timestamp().to_period())
print(ts.to_timestamp().to_period('M'))

2010-01    1.149064
2010-02    0.644664
2010-03    1.182031
2010-04    0.547196
2010-05   -0.674974
Freq: M, dtype: float64

2010-01-01    1.149064
2010-02-01    0.644664
2010-03-01    1.182031
2010-04-01    0.547196
2010-05-01   -0.674974
Freq: MS, dtype: float64

2010-01-31    1.149064
2010-02-28    0.644664
2010-03-31    1.182031
2010-04-30    0.547196
2010-05-31   -0.674974
Freq: M, dtype: float64

2010-01    1.149064
2010-02    0.644664
2010-03    1.182031
2010-04    0.547196
2010-05   -0.674974
Freq: M, dtype: float64

2010-01    1.149064
2010-02    0.644664
2010-03    1.182031
2010-04    0.547196
2010-05   -0.674974
Freq: M, dtype: float64

[ ] 重采样及频率转换

重采样（resampling）指的是将时间序列从一个频率转换到另一个频率的过程。pandas 对象都含有一个

.resample(freq, how=None, axis=0, fill_method=None, closed=None, label=None, convention='start', kind=None, loffset=None, limit=None, base=0)方法用于实现这个过程。resample 方法更多的应用场合是 freq 发生改变的时候，这时操作就分为升采样（upsampling）和降采样（downsampling）两种。具体的区别都体现在参数里。

print(ts)
#升采样
print(ts.resample('D',fill_method='ffill'))
#降采样
print(ts.resample('A-JAN',how='sum'))

[ ] 时区变换:

rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), rng)
ts.tz_localize('US/Eastern')

2012-03-06 00:00:00-05:00    0.276882
2012-03-07 00:00:00-05:00   -0.326992
2012-03-08 00:00:00-05:00   -0.076750
2012-03-09 00:00:00-05:00    2.181456
2012-03-10 00:00:00-05:00    0.031686
Freq: D, dtype: float64

在不同的时间跨度表现间变换:

rng = pd.date_range('1/1/2012', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
print(ts)
print(ts.to_period())

2012-01-31   -0.435240
2012-02-29    0.817496
2012-03-31   -0.307508
2012-04-30   -2.080925
2012-05-31   -0.155187
Freq: M, dtype: float64

2012-01   -0.435240
2012-02    0.817496
2012-03   -0.307508
2012-04   -2.080925
2012-05   -0.155187
Freq: M, dtype: float64

注:to_period()默认频率为M，to_period和to_timestamp可以相互转换。

*附：pandas的datetime 格式定义

代码	说明
%Y	4位数的年
%y	2位数的年
%m	2位数的月[01,12]
%d	2位数的日[01，31]
%H	时（24小时制）[00,23]
%l	时（12小时制）[01,12]
%M	2位数的分[00,59]
%S	秒[00,61]有闰秒的存在
%w	用整数表示的星期几[0（星期天），6]
%F	%Y-%m-%d简写形式例如，2017-06-27
%D	%m/%d/%y简写形式

10-Pandas之时间序列应用

Pandas之时间序列应用

1、Python库的日期组件

datetime模块中的数据类型

标准库中字符串和时间对象的相互转换

2、Pandas的常用日期组件

*附：pandas的datetime 格式定义

results matching ""

No results matching ""

Pandas之时间序列应用

1、Python库的日期组件

datetime模块中的数据类型

标准库中字符串 和时间对象的相互转换

2、Pandas的常用日期组件

*附：pandas的datetime 格式定义

results matching ""

No results matching ""

标准库中字符串和时间对象的相互转换