Pandas advanced tutorial: time processing

Posted by powlouk on Mon, 11 Oct 2021 02:24:05 +0200

brief introduction

Time should be a data type often used in data processing. In addition to the two data types of datetime64 and timedelta64 in Numpy, pandas also integrates the functions of other python libraries, such as scikits.timeseries.

Time classification

There are four time types in pandas:

  1. Date times: date and time, with time zone. Similar to datetime.datetime in the standard library.
  2. Time delta: absolute duration, similar to datetime.timedelta in the standard library.
  3. Time spans: the time span defined by the time point and its associated frequency.
  4. Date offsets: the time calculated based on calendar is similar to dateutil.relativedelta.relativedelta.

We use a table to show:

typeScalar classArray classpandas data typeMain creation method
Date timesTimestampDatetimeIndexdatetime64[ns] or datetime64[ns, tz]to_datetime or date_range
Time deltasTimedeltaTimedeltaIndextimedelta64[ns]to_timedelta or timedelta_range
Time spansPeriodPeriodIndexperiod[freq]Period or period_range
Date offsetsDateOffsetNoneNoneDateOffset

Let's look at an example:

In [19]: pd.Series(range(3), index=pd.date_range("2000", freq="D", periods=3))
Out[19]: 
2000-01-01    0
2000-01-02    1
2000-01-03    2
Freq: D, dtype: int64

Take a look at the null value of the above data type:

In [24]: pd.Timestamp(pd.NaT)
Out[24]: NaT

In [25]: pd.Timedelta(pd.NaT)
Out[25]: NaT

In [26]: pd.Period(pd.NaT)
Out[26]: NaT

# Equality acts as np.nan would
In [27]: pd.NaT == pd.NaT
Out[27]: False

Timestamp

Timestamp is the most basic time type. We can create it as follows:

In [28]: pd.Timestamp(datetime.datetime(2012, 5, 1))
Out[28]: Timestamp('2012-05-01 00:00:00')

In [29]: pd.Timestamp("2012-05-01")
Out[29]: Timestamp('2012-05-01 00:00:00')

In [30]: pd.Timestamp(2012, 5, 1)
Out[30]: Timestamp('2012-05-01 00:00:00')

DatetimeIndex

Timestamp as an index will be automatically converted to DatetimeIndex:

In [33]: dates = [
   ....:     pd.Timestamp("2012-05-01"),
   ....:     pd.Timestamp("2012-05-02"),
   ....:     pd.Timestamp("2012-05-03"),
   ....: ]
   ....: 

In [34]: ts = pd.Series(np.random.randn(3), dates)

In [35]: type(ts.index)
Out[35]: pandas.core.indexes.datetimes.DatetimeIndex

In [36]: ts.index
Out[36]: DatetimeIndex(['2012-05-01', '2012-05-02', '2012-05-03'], dtype='datetime64[ns]', freq=None)

In [37]: ts
Out[37]: 
2012-05-01    0.469112
2012-05-02   -0.282863
2012-05-03   -1.509059
dtype: float64

date_range and bdate_range

You can also use date_range to create DatetimeIndex:

In [74]: start = datetime.datetime(2011, 1, 1)

In [75]: end = datetime.datetime(2012, 1, 1)

In [76]: index = pd.date_range(start, end)

In [77]: index
Out[77]: 
DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',
               '2011-01-05', '2011-01-06', '2011-01-07', '2011-01-08',
               '2011-01-09', '2011-01-10',
               ...
               '2011-12-23', '2011-12-24', '2011-12-25', '2011-12-26',
               '2011-12-27', '2011-12-28', '2011-12-29', '2011-12-30',
               '2011-12-31', '2012-01-01'],
              dtype='datetime64[ns]', length=366, freq='D')

date_range is the calendar range, bdate_range is the working day range:

In [78]: index = pd.bdate_range(start, end)

In [79]: index
Out[79]: 
DatetimeIndex(['2011-01-03', '2011-01-04', '2011-01-05', '2011-01-06',
               '2011-01-07', '2011-01-10', '2011-01-11', '2011-01-12',
               '2011-01-13', '2011-01-14',
               ...
               '2011-12-19', '2011-12-20', '2011-12-21', '2011-12-22',
               '2011-12-23', '2011-12-26', '2011-12-27', '2011-12-28',
               '2011-12-29', '2011-12-30'],
              dtype='datetime64[ns]', length=260, freq='B')

Both methods can take start, end, and periods parameters.

In [84]: pd.bdate_range(end=end, periods=20)
In [83]: pd.date_range(start, end, freq="W")
In [86]: pd.date_range("2018-01-01", "2018-01-05", periods=5)

origin

Use the origin parameter to modify the starting point of DatetimeIndex:

In [67]: pd.to_datetime([1, 2, 3], unit="D", origin=pd.Timestamp("1960-01-01"))
Out[67]: DatetimeIndex(['1960-01-02', '1960-01-03', '1960-01-04'], dtype='datetime64[ns]', freq=None)

By default, origin='unix ', that is, the starting point is 1970-01-01 00:00:00

In [68]: pd.to_datetime([1, 2, 3], unit="D")
Out[68]: DatetimeIndex(['1970-01-02', '1970-01-03', '1970-01-04'], dtype='datetime64[ns]', freq=None)

format

Use the format parameter to format the time:

In [51]: pd.to_datetime("2010/11/12", format="%Y/%m/%d")
Out[51]: Timestamp('2010-11-12 00:00:00')

In [52]: pd.to_datetime("12-11-2010 00:00", format="%d-%m-%Y %H:%M")
Out[52]: Timestamp('2010-11-12 00:00:00')

Period

Period refers to a time span, which is usually used together with freq:

In [31]: pd.Period("2011-01")
Out[31]: Period('2011-01', 'M')

In [32]: pd.Period("2012-05", freq="D")
Out[32]: Period('2012-05-01', 'D')

Period can be calculated directly:

In [345]: p = pd.Period("2012", freq="A-DEC")

In [346]: p + 1
Out[346]: Period('2013', 'A-DEC')

In [347]: p - 3
Out[347]: Period('2009', 'A-DEC')

In [348]: p = pd.Period("2012-01", freq="2M")

In [349]: p + 2
Out[349]: Period('2012-05', '2M')

In [350]: p - 1
Out[350]: Period('2011-11', '2M')

Note that Period can only be counted if it has the same freq. Including offsets and timedelta

In [352]: p = pd.Period("2014-07-01 09:00", freq="H")

In [353]: p + pd.offsets.Hour(2)
Out[353]: Period('2014-07-01 11:00', 'H')

In [354]: p + datetime.timedelta(minutes=120)
Out[354]: Period('2014-07-01 11:00', 'H')

In [355]: p + np.timedelta64(7200, "s")
Out[355]: Period('2014-07-01 11:00', 'H')

Period as an index can be automatically converted to PeriodIndex:

In [38]: periods = [pd.Period("2012-01"), pd.Period("2012-02"), pd.Period("2012-03")]

In [39]: ts = pd.Series(np.random.randn(3), periods)

In [40]: type(ts.index)
Out[40]: pandas.core.indexes.period.PeriodIndex

In [41]: ts.index
Out[41]: PeriodIndex(['2012-01', '2012-02', '2012-03'], dtype='period[M]', freq='M')

In [42]: ts
Out[42]: 
2012-01   -1.135632
2012-02    1.212112
2012-03   -0.173215
Freq: M, dtype: float64

You can use pd.period_ Use the range method to create PeriodIndex:

In [359]: prng = pd.period_range("1/1/2011", "1/1/2012", freq="M")

In [360]: prng
Out[360]: 
PeriodIndex(['2011-01', '2011-02', '2011-03', '2011-04', '2011-05', '2011-06',
             '2011-07', '2011-08', '2011-09', '2011-10', '2011-11', '2011-12',
             '2012-01'],
            dtype='period[M]', freq='M')

You can also create it directly through PeriodIndex:

In [361]: pd.PeriodIndex(["2011-1", "2011-2", "2011-3"], freq="M")
Out[361]: PeriodIndex(['2011-01', '2011-02', '2011-03'], dtype='period[M]', freq='M')

DateOffset

DateOffset represents a frequency object. It is very similar to Timedelta, which represents a duration, but has special calendar rules. For example, Timedelta must be 24 hours a day, while in DateOffset, there may be 23, 24 or 25 hours a day according to different daylight saving time.

# This particular day contains a day light savings time transition
In [144]: ts = pd.Timestamp("2016-10-30 00:00:00", tz="Europe/Helsinki")

# Respects absolute time
In [145]: ts + pd.Timedelta(days=1)
Out[145]: Timestamp('2016-10-30 23:00:00+0200', tz='Europe/Helsinki')

# Respects calendar time
In [146]: ts + pd.DateOffset(days=1)
Out[146]: Timestamp('2016-10-31 00:00:00+0200', tz='Europe/Helsinki')

In [147]: friday = pd.Timestamp("2018-01-05")

In [148]: friday.day_name()
Out[148]: 'Friday'

# Add 2 business days (Friday --> Tuesday)
In [149]: two_business_days = 2 * pd.offsets.BDay()

In [150]: two_business_days.apply(friday)
Out[150]: Timestamp('2018-01-09 00:00:00')

In [151]: friday + two_business_days
Out[151]: Timestamp('2018-01-09 00:00:00')

In [152]: (friday + two_business_days).day_name()
Out[152]: 'Tuesday'

DateOffsets and Frequency operations are closed first. Take a look at the available date offsets and their associated Frequency:

Date OffsetFrequency Stringdescribe
DateOffsetNoneGeneric offset class
BDay or BusinessDay'B'weekdays
CDay or CustomBusinessDay'C'Custom working days
Week'W'a week
WeekOfMonth'WOM'The day of the week of each month
LastWeekOfMonth'LWOM'The day of the last week of each month
MonthEnd'M'Calendar month end
MonthBegin'MS'Beginning of calendar month
BMonthEnd or BusinessMonthEnd'BM'End of business month
BMonthBegin or BusinessMonthBegin'BMS'Beginning of business month
CBMonthEnd or CustomBusinessMonthEnd'CBM'Custom business month end
CBMonthBegin or CustomBusinessMonthBegin'CBMS'Custom business month beginning
SemiMonthEnd'SM'The 15th day at the end of the calendar month
SemiMonthBegin'SMS'The 15th day at the beginning of the calendar month
QuarterEnd'Q'Calendar quarter end
QuarterBegin'QS'Beginning of calendar quarter
BQuarterEnd'BQEnd of working season
BQuarterBegin'BQS'Beginning of working season
FY5253Quarter'REQ'Retail season (52-53 week)
YearEnd'A'Calendar year end
YearBegin'AS' or 'BYS'Beginning of calendar year
BYearEnd'BA'End of business
BYearBegin'BAS'Beginning of business
FY5253'RE'Retail year (aka 52-53 week)
EasterNoneEaster holiday
BusinessHour'BH'business hour
CustomBusinessHour'CBH'custom business hour
Day'D'Absolute time of day
Hour'H'an hour
Minute'T' or 'min'One Minute
Second'S'One second
Milli'L' or 'ms'One subtle
Micro'U' or 'us'One millisecond
Nano'N'One nanosecond

DateOffset also has two methods rollforward() and rollback() to move the time:

In [153]: ts = pd.Timestamp("2018-01-06 00:00:00")

In [154]: ts.day_name()
Out[154]: 'Saturday'

# BusinessHour's valid offset dates are Monday through Friday
In [155]: offset = pd.offsets.BusinessHour(start="09:00")

# Bring the date to the closest offset date (Monday)
In [156]: offset.rollforward(ts)
Out[156]: Timestamp('2018-01-08 09:00:00')

# Date is brought to the closest offset date first and then the hour is added
In [157]: ts + offset
Out[157]: Timestamp('2018-01-08 10:00:00')

The above operation will automatically save the hours, minutes and other information. If you want to set it to 00:00:00, you can call the normalize() method:

In [158]: ts = pd.Timestamp("2014-01-01 09:00")

In [159]: day = pd.offsets.Day()

In [160]: day.apply(ts)
Out[160]: Timestamp('2014-01-02 09:00:00')

In [161]: day.apply(ts).normalize()
Out[161]: Timestamp('2014-01-02 00:00:00')

In [162]: ts = pd.Timestamp("2014-01-01 22:00")

In [163]: hour = pd.offsets.Hour()

In [164]: hour.apply(ts)
Out[164]: Timestamp('2014-01-01 23:00:00')

In [165]: hour.apply(ts).normalize()
Out[165]: Timestamp('2014-01-01 00:00:00')

In [166]: hour.apply(pd.Timestamp("2014-01-01 23:30")).normalize()
Out[166]: Timestamp('2014-01-02 00:00:00')

As index

Time can be used as an index, and there are some very convenient features when it is used as an index.

Time can be directly used to obtain corresponding data:

In [99]: ts["1/31/2011"]
Out[99]: 0.11920871129693428

In [100]: ts[datetime.datetime(2011, 12, 25):]
Out[100]: 
2011-12-30    0.56702
Freq: BM, dtype: float64

In [101]: ts["10/31/2011":"12/31/2011"]
Out[101]: 
2011-10-31    0.271860
2011-11-30   -0.424972
2011-12-30    0.567020
Freq: BM, dtype: float64

Obtain annual data:

In [102]: ts["2011"]
Out[102]: 
2011-01-31    0.119209
2011-02-28   -1.044236
2011-03-31   -0.861849
2011-04-29   -2.104569
2011-05-31   -0.494929
2011-06-30    1.071804
2011-07-29    0.721555
2011-08-31   -0.706771
2011-09-30   -1.039575
2011-10-31    0.271860
2011-11-30   -0.424972
2011-12-30    0.567020
Freq: BM, dtype: float64

Get data for a month:

In [103]: ts["2011-6"]
Out[103]: 
2011-06-30    1.071804
Freq: BM, dtype: float64

DF can accept time as a parameter of loc:

In [105]: dft
Out[105]: 
                            A
2013-01-01 00:00:00  0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00  0.113648
2013-01-01 00:04:00 -1.478427
...                       ...
2013-03-11 10:35:00 -0.747967
2013-03-11 10:36:00 -0.034523
2013-03-11 10:37:00 -0.201754
2013-03-11 10:38:00 -1.509067
2013-03-11 10:39:00 -1.693043

[100000 rows x 1 columns]

In [106]: dft.loc["2013"]
Out[106]: 
                            A
2013-01-01 00:00:00  0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00  0.113648
2013-01-01 00:04:00 -1.478427
...                       ...
2013-03-11 10:35:00 -0.747967
2013-03-11 10:36:00 -0.034523
2013-03-11 10:37:00 -0.201754
2013-03-11 10:38:00 -1.509067
2013-03-11 10:39:00 -1.693043

[100000 rows x 1 columns]

Time slice:

In [107]: dft["2013-1":"2013-2"]
Out[107]: 
                            A
2013-01-01 00:00:00  0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00  0.113648
2013-01-01 00:04:00 -1.478427
...                       ...
2013-02-28 23:55:00  0.850929
2013-02-28 23:56:00  0.976712
2013-02-28 23:57:00 -2.693884
2013-02-28 23:58:00 -1.575535
2013-02-28 23:59:00 -1.573517

[84960 rows x 1 columns]

Slice and match exactly

Consider the following Series object with a precision of:

In [120]: series_minute = pd.Series(
   .....:     [1, 2, 3],
   .....:     pd.DatetimeIndex(
   .....:         ["2011-12-31 23:59:00", "2012-01-01 00:00:00", "2012-01-01 00:02:00"]
   .....:     ),
   .....: )
   .....: 

In [121]: series_minute.index.resolution
Out[121]: 'minute'

If the time precision is less than minutes, a Series object is returned:

In [122]: series_minute["2011-12-31 23"]
Out[122]: 
2011-12-31 23:59:00    1
dtype: int64

If the time precision is greater than minutes, a constant is returned:

In [123]: series_minute["2011-12-31 23:59"]
Out[123]: 1

In [124]: series_minute["2011-12-31 23:59:00"]
Out[124]: 1

Similarly, if the precision is seconds, an object will be returned if it is less than seconds, and a constant value will be returned if it is equal to seconds.

Operation of time series

Shifting

Use the shift method to move the time series accordingly:

In [275]: ts = pd.Series(range(len(rng)), index=rng)

In [276]: ts = ts[:5]

In [277]: ts.shift(1)
Out[277]: 
2012-01-01    NaN
2012-01-02    0.0
2012-01-03    1.0
Freq: D, dtype: float64

By specifying freq, you can set the shift mode:

In [278]: ts.shift(5, freq="D")
Out[278]: 
2012-01-06    0
2012-01-07    1
2012-01-08    2
Freq: D, dtype: int64

In [279]: ts.shift(5, freq=pd.offsets.BDay())
Out[279]: 
2012-01-06    0
2012-01-09    1
2012-01-10    2
dtype: int64

In [280]: ts.shift(5, freq="BM")
Out[280]: 
2012-05-31    0
2012-05-31    1
2012-05-31    2
dtype: int64

frequency conversion

The frequency of time series can be converted by calling asfreq:

In [281]: dr = pd.date_range("1/1/2010", periods=3, freq=3 * pd.offsets.BDay())

In [282]: ts = pd.Series(np.random.randn(3), index=dr)

In [283]: ts
Out[283]: 
2010-01-01    1.494522
2010-01-06   -0.778425
2010-01-11   -0.253355
Freq: 3B, dtype: float64

In [284]: ts.asfreq(pd.offsets.BDay())
Out[284]: 
2010-01-01    1.494522
2010-01-04         NaN
2010-01-05         NaN
2010-01-06   -0.778425
2010-01-07         NaN
2010-01-08         NaN
2010-01-11   -0.253355
Freq: B, dtype: float64

asfreq can also specify the filling method after modifying the frequency:

In [285]: ts.asfreq(pd.offsets.BDay(), method="pad")
Out[285]: 
2010-01-01    1.494522
2010-01-04    1.494522
2010-01-05    1.494522
2010-01-06   -0.778425
2010-01-07   -0.778425
2010-01-08   -0.778425
2010-01-11   -0.253355
Freq: B, dtype: float64

Resampling resampling

A given time series can be resampled by calling the resample method:

In [286]: rng = pd.date_range("1/1/2012", periods=100, freq="S")

In [287]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

In [288]: ts.resample("5Min").sum()
Out[288]: 
2012-01-01    25103
Freq: 5T, dtype: int64

resample can accept various statistical methods, such as sum, mean, std, sem, max, min, median, first, last, ohlc.

In [289]: ts.resample("5Min").mean()
Out[289]: 
2012-01-01    251.03
Freq: 5T, dtype: float64

In [290]: ts.resample("5Min").ohlc()
Out[290]: 
            open  high  low  close
2012-01-01   308   460    9    205

In [291]: ts.resample("5Min").max()
Out[291]: 
2012-01-01    460
Freq: 5T, dtype: int64

This article has been included in http://www.flydean.com/15-python-pandas-time/

The most popular interpretation, the most profound dry goods, the most concise tutorial, and many tips you don't know are waiting for you to find!

Welcome to my official account: "those things in procedure", understand technology, know you better!

Topics: Python Big Data Data Analysis pandas