Chapter 11 of data analysis using python, Second Edition - time series learning notes

Posted by polywog on Mon, 24 Jan 2022 06:24:04 +0100

1, Types and tools of date and time data

The modules about date, time and calendar data in Python are datatime, time and calendar

**The data types in the datatime module are as follows: from datatime import*****

dataUse the Gregorian calendar to store calendar dates (year, month, day)
timeStore time as hours, minutes, seconds, microseconds
datatimeStore date and time
timedeltaRepresents the difference between two datatime values (such as day, second, microsecond)
tzinfoThe basic type used to store time zone information

datatime formatting details (compatible with ISO C89)

%YFour digit year
%yTwo digit year
%mTwo digit months [01, 12]
%dTwo digit date number [01, 31]
%HHour, 24-hour system [00, 23]
%IHour, 12 hour system [01, 12]
%MTwo minutes [00, 59]
%SSeconds [00, 61] (60, 61 are leap seconds)
%wWeek date [0 (Sunday), 6]
%UNumber of weeks in a year [00, 53]. Take Sunday as the first day of the week, and the date before the first week of the year as' week 0 '
%WNumber of weeks in a year [00, 53]. Monday is the first day of the week, and the date before the first Monday of the year is regarded as' week 0 '
%zUTC time zone offset in the format of + HHMM or - HHMM; Null if there is no time zone
%F%Abbreviation of Y-%m-%d (e.g., 2021-6-24)
%D%Abbreviation for m/%d/%y (e.g., 06 / 24 / 21)

datatime object locale specific date formatting options

%aAbbreviated working day name
%AFull write workday name
%bAbbreviated month name
%BFull month name
%cFull date and time (e.g. 'Tue 01 May 2012 04:20:57 PM')
%pRegional equivalent of AM or PM
%xFormat date suitable for the region (for example, '05 / 01 / 2012' in the United States is May 1)
%XTime appropriate for the region (e.g. '04:24:12 PM')
import numpy as np
import pandas as pd
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 20
np.set_printoptions(precision=4, suppress=True)
from datetime import datetime
now = datetime.now()
now

'''datetime.datetime(2021, 6, 24, 16, 47, 28, 879956)'''

now.year, now.month, now.day

'''(2021, 6, 24)'''

# timedelta represents the time difference between two datetime objects
delta = datetime(2011, 1, 7) - datetime(2008, 6, 24, 8, 15)
delta

'''datetime.timedelta(days=926, seconds=56700)'''

delta.days

'''926'''

delta.seconds

'''56700'''

# Add / subtract a timedelta for the datetime object to generate a new datatime object
from datetime import timedelta
start = datetime(2011, 1, 7)
start + timedelta(12)

'''datetime.datetime(2011, 1, 19, 0, 0)'''

start - 2 * timedelta(12)

'''datetime.datetime(2010, 12, 14, 0, 0)'''

Conversion between string and datatime

# Use the str method or pass a specified format to the strftime method to format the datetime object and the Timestamp object of pandas
stamp = datetime(2011, 1, 3)
str(stamp)

''''2011-01-03 00:00:00''''

stamp.strftime('%Y-%m-%d')

''''2011-01-03''''

# Use datetime Strptime () and format code convert a string to a date
# datetime.strptime() is a good way to convert dates when the format is known
value = '2011-01-03'
datetime.strptime(value, '%Y-%m-%d')

'''datetime.datetime(2011, 1, 3, 0, 0)'''

datestrs = ['7/6/2011', '8/6/2011']
[datetime.strptime(x, '%m/%d/%Y') for x in datestrs]

'''[datetime.datetime(2011, 7, 6, 0, 0), datetime.datetime(2011, 8, 6, 0, 0)]'''

# For the common date format, you can use the parser.xml of the dateutil package Parse method
# dateutil can parse most human understandable date representations
from dateutil.parser import parse
parse('2011-01-03')

'''datetime.datetime(2011, 1, 3, 0, 0)'''

parse('Jan 31, 1997 10:45 PM')

'''datetime.datetime(1997, 1, 31, 22, 45)'''

# Pass dayfirst=True to parse the data whose date is before the month
parse('6/12/2011', dayfirst=True)

'''datetime.datetime(2011, 12, 6, 0, 0)'''

# pd.to_datetime() can convert many different date formats
datestrs = ['2011-07-06 12:00:00', '2011-08-06 00:00:00']
pd.to_datetime(datestrs)

'''DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00'], dtype='datetime64[ns]', freq=None)'''

# pd.to_datetime() can also handle values that are considered missing values (None, empty strings, etc.)
idx = pd.to_datetime(datestrs + [None])
idx

'''DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00', 'NaT'], dtype='datetime64[ns]', freq=None)'''

idx[2]  # NaT(Not a time) is a value where the timestamp data value in pandas is null

'''NaT'''

pd.isnull(idx)

'''array([False, False,  True])'''

2, Time series basis

The basic time Series type in pandas is Series indexed by timestamp;

pandas is usually represented as a python string or datatime object

from datetime import datetime
dates = [datetime(2011, 1, 2), datetime(2011, 1, 5),
         datetime(2011, 1, 7), datetime(2011, 1, 8),
         datetime(2011, 1, 10), datetime(2011, 1, 12)]

# Basic time series in pandas: Series indexed by timestamp
ts = pd.Series(np.random.randn(6), index=dates)
ts

'''
2011-01-02   -0.204708
2011-01-05    0.478943
2011-01-07   -0.519439
2011-01-08   -0.555730
2011-01-10    1.965781
2011-01-12    1.393406
dtype: float64
'''

# These datetime objects are actually placed in a DatetimeIndex
ts.index

'''
DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
               '2011-01-10', '2011-01-12'],
              dtype='datetime64[ns]', freq=None)
'''
# Like other Series, arithmetic operations between time Series with different indexes are automatically aligned by date
ts + ts[::2]

'''
2011-01-02   -0.409415
2011-01-05         NaN
2011-01-07   -1.038877
2011-01-08         NaN
2011-01-10    3.931561
2011-01-12         NaN
dtype: float64
'''

# pandas uses NumPy's datetime64 data type to store timestamps in nanoseconds ns
ts.index.dtype

'''dtype('<M8[ns]')'''

# Each scalar value in DatetimeIndex is the Timestamp object of pandas
stamp = ts.index[0]
stamp

'''Timestamp('2011-01-02 00:00:00')'''

Index, select, subset

# When indexing and selecting based on tags, time series and other pandas Series similar
stamp = ts.index[2]
ts[stamp]

'''-0.5194387150567381'''

# Pass a string that can be interpreted as a date, '20110110' is also OK
ts['1/10/2011']

'''1.9657805725027142'''
# For long time series, you can pass a string selection data that can be interpreted as a year or year month
longer_ts = pd.Series(np.random.randn(1000),
                      index=pd.date_range('1/1/2000', periods=1000))
longer_ts

'''
2000-01-01    0.092908
2000-01-02    0.281746
2000-01-03    0.769023
2000-01-04    1.246435
2000-01-05    1.007189
                ...   
2002-09-22    0.930944
2002-09-23   -0.811676
2002-09-24   -1.830156
2002-09-25   -0.138730
2002-09-26    0.334088
Freq: D, Length: 1000, dtype: float64
'''

longer_ts['2001']

'''
2001-01-01    1.599534
2001-01-02    0.474071
2001-01-03    0.151326
2001-01-04   -0.542173
2001-01-05   -0.475496
                ...   
2001-12-27    0.057874
2001-12-28   -0.433739
2001-12-29    0.092698
2001-12-30   -1.397820
2001-12-31    1.457823
Freq: D, Length: 365, dtype: float64
'''

longer_ts['2001-05']

'''
2001-05-01   -0.622547
2001-05-02    0.936289
2001-05-03    0.750018
2001-05-04   -0.056715
2001-05-05    2.300675
                ...   
2001-05-27    0.235477
2001-05-28    0.111835
2001-05-29   -1.251504
2001-05-30   -2.949343
2001-05-31    0.634634
Freq: D, Length: 31, dtype: float64
'''
ts

'''
2011-01-02   -0.204708
2011-01-05    0.478943
2011-01-07   -0.519439
2011-01-08   -0.555730
2011-01-10    1.965781
2011-01-12    1.393406
dtype: float64
'''

# Pass datatime object for slicing
ts[datetime(2011, 1, 7):]

'''
2011-01-07   -0.519439
2011-01-08   -0.555730
2011-01-10    1.965781
2011-01-12    1.393406
dtype: float64
'''

# Pass timestamps that are not included in the time series for slicing to perform a range query
ts['1/6/2011':'1/11/2011']

'''
2011-01-07   -0.519439
2011-01-08   -0.555730
2011-01-10    1.965781
dtype: float64
'''

Passing a string that can not be interpreted as a date, a datatime object, or a timestamp for slicing are views that produce the original time series; That is, no data is copied, and the changes on the slice will be reflected on the original data

# The equivalent instance method truncate can slice Series between two dates
ts.truncate(before='1/5/2011', after='1/10/2011')

'''
2011-01-05    0.478943
2011-01-07   -0.519439
2011-01-08   -0.555730
2011-01-10    1.965781
dtype: float64
'''
# The above operations also apply to DataFrame
dates = pd.date_range('1/1/2000', periods=100, freq='W-WED')
long_df = pd.DataFrame(np.random.randn(100, 4),
                       index=dates,
                       columns=['Colorado', 'Texas',
                                'New York', 'Ohio'])
long_df.loc['5-2001']

Time series with duplicate indexes

# There are multiple data observations on a timestamp, that is, the time series contains duplicate indexes
dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000',
                          '1/2/2000', '1/3/2000'])
dup_ts = pd.Series(np.arange(5), index=dates)
dup_ts

'''
2000-01-01    0
2000-01-02    1
2000-01-02    2
2000-01-02    3
2000-01-03    4
dtype: int32
'''

# Through index is_ Unique property to check whether the index is unique
dup_ts.index.is_unique

'''False'''
# When time Series with duplicate indexes are indexed, whether the result is scalar or Series slice depends on whether the timestamp is repeated
dup_ts['1/3/2000']  # not duplicated

'''4'''

dup_ts['1/2/2000']  # duplicated

'''
2000-01-02    1
2000-01-02    2
2000-01-02    3
dtype: int32
'''
# Aggregate data with duplicate indexes and pass level=0 to groupby
grouped = dup_ts.groupby(level=0)
grouped.mean()

'''
2000-01-01    0
2000-01-02    2
2000-01-03    4
dtype: int32
'''

grouped.count()

'''
2000-01-01    1
2000-01-02    3
2000-01-03    1
dtype: int64
'''

3, Date range, frequency and shift

Generation date range

# pd.date_range(): generate DatatimeIndex of specified length according to specific frequency
# The daily timestamp is generated by default
index = pd.date_range('2012-04-01', '2012-06-01')
index

'''
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
               '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
               '2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
               '2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
               '2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20',
               '2012-04-21', '2012-04-22', '2012-04-23', '2012-04-24',
               '2012-04-25', '2012-04-26', '2012-04-27', '2012-04-28',
               '2012-04-29', '2012-04-30', '2012-05-01', '2012-05-02',
               '2012-05-03', '2012-05-04', '2012-05-05', '2012-05-06',
               '2012-05-07', '2012-05-08', '2012-05-09', '2012-05-10',
               '2012-05-11', '2012-05-12', '2012-05-13', '2012-05-14',
               '2012-05-15', '2012-05-16', '2012-05-17', '2012-05-18',
               '2012-05-19', '2012-05-20', '2012-05-21', '2012-05-22',
               '2012-05-23', '2012-05-24', '2012-05-25', '2012-05-26',
               '2012-05-27', '2012-05-28', '2012-05-29', '2012-05-30',
               '2012-05-31', '2012-06-01'],
              dtype='datetime64[ns]', freq='D')
'''
# If you pass only a start or end date, you must pass a number for the generation range to periods
pd.date_range(start='2012-04-01', periods=20)

'''
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
               '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
               '2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
               '2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
               '2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20'],
              dtype='datetime64[ns]', freq='D')
'''

pd.date_range(end='2012-06-01', periods=20)

'''
DatetimeIndex(['2012-05-13', '2012-05-14', '2012-05-15', '2012-05-16',
               '2012-05-17', '2012-05-18', '2012-05-19', '2012-05-20',
               '2012-05-21', '2012-05-22', '2012-05-23', '2012-05-24',
               '2012-05-25', '2012-05-26', '2012-05-27', '2012-05-28',
               '2012-05-29', '2012-05-30', '2012-05-31', '2012-06-01'],
              dtype='datetime64[ns]', freq='D')
'''
# Passing the time series frequency to freq will generate DatatimeIndex falling within the specified date range
pd.date_range('2000-01-01', '2000-12-01', freq='BM')

'''
DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-28',
               '2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31',
               '2000-09-29', '2000-10-31', '2000-11-30'],
              dtype='datetime64[ns]', freq='BM')
'''
# pd.date_range() retains the start or end timestamp by default
# Passing normalize=True normalizes the timestamp to zero, that is, the time in the timestamp is removed
pd.date_range('2012-05-02 12:56:31', periods=5)

'''
DatetimeIndex(['2012-05-02 12:56:31', '2012-05-03 12:56:31',
               '2012-05-04 12:56:31', '2012-05-05 12:56:31',
               '2012-05-06 12:56:31'],
              dtype='datetime64[ns]', freq='D')
'''

pd.date_range('2012-05-02 12:56:31', periods=5, normalize=True)

'''
DatetimeIndex(['2012-05-02', '2012-05-03', '2012-05-04', '2012-05-05',
               '2012-05-06'],
              dtype='datetime64[ns]', freq='D')
'''
Base time series frequency
aliasOffset typedescribe
DDayDay of calendar day
MBusinessDayEvery day of the working day
HHourPer hour
T or minMinuteper minute
SSecondPer second
L or msMilliPer millisecond (1 / 1000 second)
UMicroPer microsecond (1 / 1000000 second)
MMonthEndLast day of the calendar month
BMBusinessMonthEndLast day of the month on the working day
MSMonthBeginFirst day of the calendar month
BMSBusinessMonthBeginFirst day of the month on the working day
W-MON,W-TUE,......WeekAccording to the given day of the week, take the day of the week (MON,TUE,WED,THU,FRI,SAT or SUN)
WOM-1MON,WOM-2MON,......WeekOfMonthCreate week separated dates on the first / second / third or fourth week of the month (e.g. WOM-3FRI: the third Friday of each month)
Q-JAN,Q-FEB,......QuarterEndThe quarterly date of the last calendar day of each month to indicate the year in which the month ends (JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV or DEC)
BQ-JAN,BQ-FEB,......BusinessQuarterEndThe quarterly date corresponding to the last working day of each month to represent the year when the month ends
QS-JAN,QS-FEB,......QuarterBeginThe quarterly date corresponding to the first calendar day of each month to indicate the year when the month ends
BQS-JAN,BQS-FEB,......BusinessQuarterBeginThe quarterly date corresponding to the first working day of each month to represent the year in which the month ends
A-JAN,A-FEB,......YearEndThe annual date (JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV or DEC) of the last calendar day of the month in which the given month is located
BA-JAN,BA-FEB,......BusinessYearEndThe annual date of the last working day of the month in which the given month is located
AS-JAN,AS-FEB,......YearBeginThe annual date of the first calendar day of the month in which the given month is located
BAS-JAN,BAS-FEB,......BusinessYearBeginThe annual date of the first working day of the month in which the given month is located

Frequency and date offset

# For each base frequency, there is an object that can be used to define the date offset
# For example, the frequency per Hour can be represented by the Hour class
from pandas.tseries.offsets import Hour, Minute
hour = Hour()
hour

'''<Hour>'''

# Pass an integer to define a multiple of the offset
four_hours = Hour(4)
four_hours

'''<4 * Hours>'''

pd.date_range('2000-01-01', '2000-01-03 23:59', freq='4h')

'''
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 04:00:00',
               '2000-01-01 08:00:00', '2000-01-01 12:00:00',
               '2000-01-01 16:00:00', '2000-01-01 20:00:00',
               '2000-01-02 00:00:00', '2000-01-02 04:00:00',
               '2000-01-02 08:00:00', '2000-01-02 12:00:00',
               '2000-01-02 16:00:00', '2000-01-02 20:00:00',
               '2000-01-03 00:00:00', '2000-01-03 04:00:00',
               '2000-01-03 08:00:00', '2000-01-03 12:00:00',
               '2000-01-03 16:00:00', '2000-01-03 20:00:00'],
              dtype='datetime64[ns]', freq='4H')
'''

# Multiple offsets can be combined by addition
Hour(2) + Minute(30)

'''<150 * Minutes>'''

pd.date_range('2000-01-01', periods=10, freq='1h30min')

'''
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 01:30:00',
               '2000-01-01 03:00:00', '2000-01-01 04:30:00',
               '2000-01-01 06:00:00', '2000-01-01 07:30:00',
               '2000-01-01 09:00:00', '2000-01-01 10:30:00',
               '2000-01-01 12:00:00', '2000-01-01 13:30:00'],
              dtype='datetime64[ns]', freq='90T')
'''

# Pass freq='WOM-3FRI 'to get the date of the third Friday of a week in the specified date range
rng = pd.date_range('2012-01-01', '2012-09-01', freq='WOM-3FRI')
list(rng)

'''
[Timestamp('2012-01-20 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-02-17 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-03-16 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-04-20 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-05-18 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-06-15 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-07-20 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-08-17 00:00:00', freq='WOM-3FRI')]
'''

Shift date forward or backward

# Shift refers to moving the data in the time series forward or backward in time
# It is implemented through the shift() method of Series or DataFrame
ts = pd.Series(np.random.randn(4),
               index=pd.date_range('1/1/2000', periods=4, freq='M'))
ts

'''
2000-01-31   -0.066748
2000-02-29    0.838639
2000-03-31   -0.117388
2000-04-30   -0.517795
Freq: M, dtype: float64
'''

# Shifting data forward introduces a missing value in the start bit
ts.shift(2)

'''
2000-01-31         NaN
2000-02-29         NaN
2000-03-31   -0.066748
2000-04-30    0.838639
Freq: M, dtype: float64
'''

# Shifting data backward will introduce missing values at the end
ts.shift(-2)

'''
2000-01-31   -0.117388
2000-02-29   -0.517795
2000-03-31         NaN
2000-04-30         NaN
Freq: M, dtype: float64
'''

shift() is often used to calculate the percentage change of time series or DataFrame multi column time series, i.e. ts/ts.shift(1)-1

# Passing the frequency freq='M 'to shift will shift the timestamp instead of the data
# Here, it means that the timestamp is shifted forward by two months, i.e. + 2 months
# Note the difference between the forward shift of data and the forward shift of timestamp
ts.shift(2, freq='M')

'''
2000-03-31   -0.066748
2000-04-30    0.838639
2000-05-31   -0.117388
2000-06-30   -0.517795
Freq: M, dtype: float64
'''

# Passing freq='D 'means that each timestamp is shifted forward by 3 days, i.e. + 3 days
ts.shift(3, freq='D')

'''
2000-02-03   -0.066748
2000-03-03    0.838639
2000-04-03   -0.117388
2000-05-03   -0.517795
dtype: float64
'''

# # Passing freq='90T 'means that each timestamp is shifted forward by 90 minutes
ts.shift(1, freq='90T')

'''
2000-01-31 01:30:00   -0.066748
2000-02-29 01:30:00    0.838639
2000-03-31 01:30:00   -0.117388
2000-04-30 01:30:00   -0.517795
dtype: float64
'''

Shift date using offset

# pandas date offset can also be done using datatime or Timestamp objects
from pandas.tseries.offsets import Day, MonthEnd
now = datetime(2011, 11, 17)
now + 3 * Day()

'''Timestamp('2011-11-20 00:00:00')'''

# Anchor offset, such as MonthEnd, and BusinessMonthEnd will roll forward the date to the next date
now + MonthEnd()

'''Timestamp('2011-11-30 00:00:00')'''

now + MonthEnd(2)

'''Timestamp('2011-12-31 00:00:00')'''

# The anchor offset can be used to scroll the date forward or backward displayed by rollforward and rollback
offset = MonthEnd()
offset.rollforward(now)

'''Timestamp('2011-11-30 00:00:00')'''

offset.rollback(now)

'''Timestamp('2011-12-31 00:00:00')'''
# Use the shift method with groupby
ts = pd.Series(np.random.randn(20),
               index=pd.date_range('1/15/2000', periods=20, freq='4d'))
ts

'''
2000-01-15   -0.116696
2000-01-19    2.389645
2000-01-23   -0.932454
2000-01-27   -0.229331
2000-01-31   -1.140330
2000-02-04    0.439920
2000-02-08   -0.823758
2000-02-12   -0.520930
2000-02-16    0.350282
2000-02-20    0.204395
2000-02-24    0.133445
2000-02-28    0.327905
2000-03-03    0.072153
2000-03-07    0.131678
2000-03-11   -1.297459
2000-03-15    0.997747
2000-03-19    0.870955
2000-03-23   -0.991253
2000-03-27    0.151699
2000-03-31    1.266151
Freq: 4D, dtype: float64
'''

# offset.rollforward will operate on the timestamp index of ts by default
ts.groupby(offset.rollforward).mean()

'''
2000-01-31   -0.005833
2000-02-29    0.015894
2000-03-31    0.150209
dtype: float64
'''

# The resample() method can achieve the same effect
ts.resample('M').mean()

'''
2000-01-31   -0.005833
2000-02-29    0.015894
2000-03-31    0.150209
Freq: M, dtype: float64
'''

4, Time zone processing

# The time zone is usually represented as the offset of UTC. The time zone information comes from a third-party library pytz
import pytz
pytz.common_timezones[-5:]

'''['US/Eastern', 'US/Hawaii', 'US/Mountain', 'US/Pacific', 'UTC']'''

# pytz.timezone() gets pytz's time zone object
tz = pytz.timezone('America/New_York')
tz

'''<DstTzInfo 'America/New_York' LMT-1 day, 19:04:00 STD>'''

Localization and conversion of time zones

rng = pd.date_range('3/9/2012 9:30', periods=6, freq='D')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

'''
2012-03-09 09:30:00   -0.202469
2012-03-10 09:30:00    0.050718
2012-03-11 09:30:00    0.639869
2012-03-12 09:30:00    0.597594
2012-03-13 09:30:00   -0.797246
2012-03-14 09:30:00    0.472879
Freq: D, dtype: float64
'''

# Used to return the time zone
print(ts.index.tz)

'''None'''

# Date ranges can be generated from time zone collections
pd.date_range('3/9/2012 9:30', periods=10, freq='D', tz='UTC')

'''
DatetimeIndex(['2012-03-09 09:30:00+00:00', '2012-03-10 09:30:00+00:00',
               '2012-03-11 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
               '2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00',
               '2012-03-15 09:30:00+00:00', '2012-03-16 09:30:00+00:00',
               '2012-03-17 09:30:00+00:00', '2012-03-18 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='D')
'''
ts

'''
2012-03-09 09:30:00   -0.202469
2012-03-10 09:30:00    0.050718
2012-03-11 09:30:00    0.639869
2012-03-12 09:30:00    0.597594
2012-03-13 09:30:00   -0.797246
2012-03-14 09:30:00    0.472879
Freq: D, dtype: float64
'''

# With tz_localize() converts a simple time zone to a localized time zone
ts_utc = ts.tz_localize('UTC')
ts_utc

'''
2012-03-09 09:30:00+00:00   -0.202469
2012-03-10 09:30:00+00:00    0.050718
2012-03-11 09:30:00+00:00    0.639869
2012-03-12 09:30:00+00:00    0.597594
2012-03-13 09:30:00+00:00   -0.797246
2012-03-14 09:30:00+00:00    0.472879
Freq: D, dtype: float64
'''

ts_utc.index

'''
DatetimeIndex(['2012-03-09 09:30:00+00:00', '2012-03-10 09:30:00+00:00',
               '2012-03-11 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
               '2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='D')
'''

# Localized time zones are available through tz_convert() to another time zone
ts_utc.tz_convert('America/New_York')

'''
2012-03-09 04:30:00-05:00   -0.202469
2012-03-10 04:30:00-05:00    0.050718
2012-03-11 05:30:00-04:00    0.639869
2012-03-12 05:30:00-04:00    0.597594
2012-03-13 05:30:00-04:00   -0.797246
2012-03-14 05:30:00-04:00    0.472879
Freq: D, dtype: float64
'''

# With tz_localize() localizes the time zone to the New York time zone
ts_eastern = ts.tz_localize('America/New_York')
ts_eastern.tz_convert('UTC')

'''
2012-03-09 14:30:00+00:00   -0.202469
2012-03-10 14:30:00+00:00    0.050718
2012-03-11 13:30:00+00:00    0.639869
2012-03-12 13:30:00+00:00    0.597594
2012-03-13 13:30:00+00:00   -0.797246
2012-03-14 13:30:00+00:00    0.472879
dtype: float64
'''

ts_eastern.tz_convert('Europe/Berlin')

'''
2012-03-09 15:30:00+01:00   -0.202469
2012-03-10 15:30:00+01:00    0.050718
2012-03-11 14:30:00+01:00    0.639869
2012-03-12 14:30:00+01:00    0.597594
2012-03-13 14:30:00+01:00   -0.797246
2012-03-14 14:30:00+01:00    0.472879
dtype: float64
'''

# tz_localize(),tz_convert() is also an instance method of DatetimeIndex
ts.index.tz_localize('Asia/Shanghai')

'''
DatetimeIndex(['2012-03-09 09:30:00+08:00', '2012-03-10 09:30:00+08:00',
               '2012-03-11 09:30:00+08:00', '2012-03-12 09:30:00+08:00',
               '2012-03-13 09:30:00+08:00', '2012-03-14 09:30:00+08:00'],
              dtype='datetime64[ns, Asia/Shanghai]', freq=None)
'''

Operation of time zone aware timestamp object

# Separate Timestamp objects can also be localized to time zone aware timestamps and time zone conversion
stamp = pd.Timestamp('2011-03-12 04:00')
stamp_utc = stamp.tz_localize('utc')
stamp_utc.tz_convert('America/New_York')

'''Timestamp('2011-03-11 23:00:00-0500', tz='America/New_York')'''

# Pass time zone when creating Timestamp object
stamp_moscow = pd.Timestamp('2011-03-12 04:00', tz='Europe/Moscow')
stamp_moscow

'''Timestamp('2011-03-12 04:00:00+0300', tz='Europe/Moscow')'''

# The time zone aware Timestamp object stores UTC Timestamp values in nanoseconds since the Unix era (1970-1-1)
stamp_utc.value

'''1299902400000000000'''

# The number of nanoseconds UTC timestamp value is unchanged in time zone conversion
stamp_utc.tz_convert('America/New_York').value

'''1299902400000000000'''
from pandas.tseries.offsets import Hour
stamp = pd.Timestamp('2012-03-12 01:30', tz='US/Eastern')
stamp

'''Timestamp('2012-03-12 01:30:00-0400', tz='US/Eastern')'''

# Hour() added is 1 hour of UTC
stamp + Hour()

'''Timestamp('2012-03-12 02:30:00-0400', tz='US/Eastern')'''

stamp = pd.Timestamp('2012-11-04 00:30', tz='US/Eastern')
stamp

'''Timestamp('2012-11-04 00:30:00-0400', tz='US/Eastern')'''

stamp + 2 * Hour()

'''Timestamp('2012-11-04 01:30:00-0500', tz='US/Eastern')'''

Operation in different time intervals

# Operations in different time intervals will be automatically converted to UTC time first, and the result is also UTC time
rng = pd.date_range('3/7/2012 9:30', periods=10, freq='B')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

'''
2012-03-07 09:30:00    0.522356
2012-03-08 09:30:00   -0.546348
2012-03-09 09:30:00   -0.733537
2012-03-12 09:30:00    1.302736
2012-03-13 09:30:00    0.022199
2012-03-14 09:30:00    0.364287
2012-03-15 09:30:00   -0.922839
2012-03-16 09:30:00    0.312656
2012-03-19 09:30:00   -1.128497
2012-03-20 09:30:00   -0.333488
Freq: B, dtype: float64
'''

ts1 = ts[:7].tz_localize('Europe/London')
ts2 = ts1[2:].tz_convert('Europe/Moscow')
result = ts1 + ts2
result.index

'''
DatetimeIndex(['2012-03-07 09:30:00+00:00', '2012-03-08 09:30:00+00:00',
               '2012-03-09 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
               '2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00',
               '2012-03-15 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq=None)
'''

5, Time interval and interval arithmetic

The Period class represents a time interval, that is, a time range. For example, some days, some months, some quarters, some years

# The following Period objects represent the Period from January 1, 2007 to December 31, 2007
p = pd.Period(2007, freq='A-DEC')
p

'''Period('2007', 'A-DEC')'''

# When doing addition and subtraction, shift directly according to the frequency freq='A-DEC '
p + 5

'''Period('2012', 'A-DEC')'''

p - 2

'''Period('2005', 'A-DEC')'''

# The difference between two intervals with the same frequency is a multiple of the frequency
pd.Period('2014', freq='A-DEC') - p

'''<7 * YearEnds: month=12>'''
# Use PD period_ Range() can construct regular interval sequences
rng = pd.period_range('2000-01-01', '2000-06-30', freq='M')
rng

'''PeriodIndex(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'], dtype='period[M]', freq='M')'''

# PeriodIndex class stores interval sequence, which can be used as axis index of pandas data structure
pd.Series(np.random.randn(6), index=rng)

'''
2000-01   -0.514551
2000-02   -0.559782
2000-03   -0.783408
2000-04   -1.797685
2000-05   -0.172670
2000-06    0.680215
Freq: M, dtype: float64
'''
# pd.PeriodIndex() generates the PeriodIndex class
values = ['2001Q3', '2002Q2', '2003Q1']
index = pd.PeriodIndex(values, freq='Q-DEC')
index

'''PeriodIndex(['2001Q3', '2002Q2', '2003Q1'], dtype='period[Q-DEC]', freq='Q-DEC')'''

Interval frequency conversion

# Create a Period interval with December as the end month of the year
p = pd.Period('2007', freq='A-DEC')
p

'''Period('2007', 'A-DEC')'''

# asfreq() can convert interval and PeriodIndex objects to other frequencies
p.asfreq('M', how='start')

'''Period('2007-01', 'M')'''

p.asfreq('M', how='end')

'''Period('2007-12', 'M')'''
# Create a Period interval with June as the end month of the year
# In this case, the period from July 2006 to June 2007 is regarded as the year of 2007
p = pd.Period('2007', freq='A-JUN')
p

'''Period('2007', 'A-JUN')'''

# Specifying how as' start 'will return the first month of 2007
# Since June is the end month of the year, the first month is 2006-07
p.asfreq('M', 'start')

'''Period('2006-07', 'M')'''

p.asfreq('M', 'end')

'''Period('2007-06', 'M')'''
# The above is the conversion from low frequency to high frequency, or from high frequency to low frequency, that is, similar month to year
p = pd.Period('Aug-2007', 'M')
p.asfreq('A-JUN')
# When 'A-JUN' is imported, the end month of the table name year is June, so 2007-8 belongs to 2008

'''Period('2008', 'A-JUN')'''

p = pd.Period('Aug-2007', 'M')
p.asfreq('A-SEP')
# When 'A-SEP' is imported, the end month of the table name year is June, so 2007-8 belongs to 2007

'''Period('2007', 'A-SEP')'''
# The complete PeriodIndex object or time series can do the above similar conversion
rng = pd.period_range('2006', '2009', freq='A-DEC')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

'''
2006    1.607578
2007    0.200381
2008   -0.834068
2009   -0.302988
Freq: A-DEC, dtype: float64
'''

ts.asfreq('M', how='start')

'''
2006-01    1.607578
2007-01    0.200381
2008-01   -0.834068
2009-01   -0.302988
Freq: M, dtype: float64
'''

ts.asfreq('B', how='end')

'''
2006-12-29    1.607578
2007-12-31    0.200381
2008-12-31   -0.834068
2009-12-31   -0.302988
Freq: B, dtype: float64
'''

Quarterly interval frequency

# Quarterly data are generally at the end of the fiscal year, so similar to 2012Q4 generally have different meanings
# Set freq='Q-JAN ', indicating that the month at the end of the quarter is January
# Therefore, 2012Q4 refers to the four months from November 2011 to January 2012
p = pd.Period('2012Q4', freq='Q-JAN')
p

'''Period('2012Q4', 'Q-JAN')'''

p.asfreq('D', 'start')

'''Period('2011-11-01', 'D')'''

p.asfreq('D', 'end')

'''Period('2012-01-31', 'D')'''

# Get the timestamp of 4:00 p.m. on the penultimate business day of the quarter
# p. Asfreq ('b ',' e ') - 1 get the penultimate work of the quarter
# (p.asfreq('B', 'e') - 1). Asfreq ('t ','s') converts the obtained date into minutes and is the zero point of the day
# Another '+ 16 * 60' is the number of minutes plus 4 p.m
p4pm = (p.asfreq('B', 'e') - 1).asfreq('T', 's') + 16 * 60
p4pm

'''Period('2012-01-30 16:00', 'T')'''

# Use to_timestamp() to timestamp
p4pm.to_timestamp()

'''Timestamp('2012-01-30 16:00:00')'''
# pd.period_range() is used to generate quarterly series
rng = pd.period_range('2011Q3', '2012Q4', freq='Q-JAN')
ts = pd.Series(np.arange(len(rng)), index=rng)
ts

'''
2011Q3    0
2011Q4    1
2012Q1    2
2012Q2    3
2012Q3    4
2012Q4    5
Freq: Q-JAN, dtype: int32
'''

# Get the timestamp of 4:00 p.m. on the penultimate business day of the quarter
new_rng = (rng.asfreq('B', 'e') - 1).asfreq('T', 's') + 16 * 60
ts.index = new_rng.to_timestamp()
ts

'''
2010-10-28 16:00:00    0
2011-01-28 16:00:00    1
2011-04-28 16:00:00    2
2011-07-28 16:00:00    3
2011-10-28 16:00:00    4
2012-01-30 16:00:00    5
dtype: int32
'''

Convert timestamp to interval (and inverse conversion)

rng = pd.date_range('2000-01-01', periods=3, freq='M')
ts = pd.Series(np.random.randn(3), index=rng)
ts

'''
2000-01-31    1.663261
2000-02-29   -0.996206
2000-03-31    1.521760
Freq: M, dtype: float64
'''

# Series and DataFrame indexed by timestamp are available to_ The period() method is converted to an interval
pts = ts.to_period()
pts

'''
2000-01    1.663261
2000-02   -0.996206
2000-03    1.521760
Freq: M, dtype: float64
'''
rng = pd.date_range('1/29/2000', periods=6, freq='D')
ts2 = pd.Series(np.random.randn(6), index=rng)
ts2

'''
2000-01-29    0.244175
2000-01-30    0.423331
2000-01-31   -0.654040
2000-02-01    2.089154
2000-02-02   -0.060220
2000-02-03   -0.167933
Freq: D, dtype: float64
'''

# After interval conversion, it is allowed that the index contains repeated intervals
ts2.to_period('M')

'''
2000-01    0.244175
2000-01    0.423331
2000-01   -0.654040
2000-02    2.089154
2000-02   -0.060220
2000-02   -0.167933
Freq: M, dtype: float64
'''
# Create JAN-1 as the month at the end of the last quarter of each year
rng = pd.period_range('2011Q3', '2012Q4', freq='Q-JAN')
ts3 = pd.Series(np.arange(len(rng)), index=rng)
ts3

'''
2011Q3    0
2011Q4    1
2012Q1    2
2012Q2    3
2012Q3    4
2012Q4    5
Freq: Q-JAN, dtype: int32
'''

# to_timestamp() converts the interval to a timestamp
# how defaults to start, and takes the zero point of the first month of the quarter as the index; how='end 'can also be set
ts3.to_timestamp()

'''
2010-08-01    0
2010-11-01    1
2011-02-01    2
2011-05-01    3
2011-08-01    4
2011-11-01    5
Freq: QS-NOV, dtype: int32
'''

Generate PeriodIndex from array

data = pd.read_csv('examples/macrodata.csv')
data.head(5)

data.year.head()

'''
0    1959.0
1    1959.0
2    1959.0
3    1959.0
4    1960.0
Name: year, dtype: float64
'''

data.quarter.head()

'''
0    1.0
1    2.0
2    3.0
3    4.0
4    1.0
Name: quarter, dtype: float64
'''

index = pd.PeriodIndex(year=data.year, quarter=data.quarter,
                       freq='Q-DEC')
index

'''
PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
             '1960Q3', '1960Q4', '1961Q1', '1961Q2',
             ...
             '2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
             '2008Q4', '2009Q1', '2009Q2', '2009Q3'],
            dtype='period[Q-DEC]', length=203, freq='Q-DEC')
'''

data.index = index
data.head()

6, Resampling and frequency conversion

Resampling: the process of converting a time series from one frequency to another

Down sampling: aggregates higher frequencies to lower frequencies

Up sampling: converts low frequencies to high frequencies

However, not all resampling are the above two types, such as converting W-WED to W-REI

The resample() method of the pandas object performs frequency conversion, which has a function similar to group by

rng = pd.date_range('2000-01-01', periods=100, freq='D')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

'''
2000-01-01    0.631634
2000-01-02   -1.594313
2000-01-03   -1.519937
2000-01-04    1.108752
2000-01-05    1.255853
                ...   
2000-04-05   -0.423776
2000-04-06    0.789740
2000-04-07    0.937568
2000-04-08   -2.253294
2000-04-09   -1.772919
Freq: D, Length: 100, dtype: float64
'''

ts.resample('M').mean()

'''
2000-01-31   -0.165893
2000-02-29    0.078606
2000-03-31    0.223811
2000-04-30   -0.063643
Freq: M, dtype: float64
'''

# kind='period 'specifies that the type of index in the result is period
ts.resample('M', kind='period').mean()

'''
2000-01   -0.165893
2000-02    0.078606
2000-03    0.223811
2000-04   -0.063643
Freq: M, dtype: float64
'''
Parameters of the resample() method
freqSampling frequency, which is a string or DataOffset object (such as' 5min ','M', or Second(1))
axisWhich axis to sample along, the default axis=0
fill_methodThe difference method when sampling upward. Interpolation is not allowed by default. Options include 'fill' and 'bfill'
closedIn down sampling, which segment of each interval is closed can be 'right' or 'left'
labelIn downward sampling, mark the aggregation results with the box label of 'right' /'left '(for example, the 5-minute interval from 9:30 to 9:35 can be marked as 9:30 / 9:35)
loffsetAdjust the time of the box label (for example, '1s' / Second(-1) can move the aggregate label forward for 1 second)
limitWhen filling forward or backward, the maximum value of the filling interval
kindSpecify that the index in the result is interval ('period ') or timestamp ('timestamp'), and the default is time series index
conventionWhen resampling an interval, the convention used to convert a low-frequency period to a high-frequency period is' start 'by default and' end 'is optional

Down sampling

Things to consider when using resample() to sample down:

  • Which side of each interval is closed
  • How to mark each aggregated box at the beginning or end of the interval
rng = pd.date_range('2000-01-01', periods=12, freq='T')
ts = pd.Series(np.arange(12), index=rng)
ts

'''
2000-01-01 00:00:00     0
2000-01-01 00:01:00     1
2000-01-01 00:02:00     2
2000-01-01 00:03:00     3
2000-01-01 00:04:00     4
2000-01-01 00:05:00     5
2000-01-01 00:06:00     6
2000-01-01 00:07:00     7
2000-01-01 00:08:00     8
2000-01-01 00:09:00     9
2000-01-01 00:10:00    10
2000-01-01 00:11:00    11
Freq: T, dtype: int32
'''

# The default is closed='left ', which divides the index interval into [00:00,00:05], [00:05,00:10),
# Then aggregate the data at such intervals
# By default, label='left ', the left side of the index interval is taken as the row label in the result
ts.resample('5min', closed='left').sum()

'''
2000-01-01 00:00:00    10
2000-01-01 00:05:00    35
2000-01-01 00:10:00    21
Freq: 5T, dtype: int32
'''

ts.resample('5min', closed='right').sum()

'''
1999-12-31 23:55:00     0
2000-01-01 00:00:00    15
2000-01-01 00:05:00    40
2000-01-01 00:10:00    11
Freq: 5T, dtype: int32
'''

ts.resample('5min', closed='right', label='right').sum()

'''
2000-01-01 00:00:00     0
2000-01-01 00:05:00    15
2000-01-01 00:10:00    40
2000-01-01 00:15:00    11
Freq: 5T, dtype: int32
'''

# Setting loffset='1s' will move the index in the result to the right for 1 second, that is, add 1 second
ts.resample('5min', closed='right',
            label='right', loffset='1s').sum()

'''
2000-01-01 00:00:01     0
2000-01-01 00:05:01    15
2000-01-01 00:10:01    40
2000-01-01 00:15:01    11
Freq: 5T, dtype: int32
'''

Start peak valley end (OHLC) resampling

# ohlc() gets the first value, the last value, the maximum value, and the minimum value within the index interval
ts.resample('5min').ohlc()

Up sampling and difference

# When upsampling, aggregation is not required, and asfreq() is used for frequency conversion
frame = pd.DataFrame(np.random.randn(2, 4),
                     index=pd.date_range('1/1/2000', periods=2,
                                         freq='W-WED'),
                     columns=['Colorado', 'Texas', 'New York', 'Ohio'])
frame

# When asfreq() is used for frequency conversion, missing values will be introduced when converting from low frequency to high frequency
df_daily = frame.resample('D').asfreq()
df_daily

# If you do not want to introduce the missing value, you can set the filling value with fill()
frame.resample('D').ffill()

# If limit=2 is passed to fill(), only two rows of data will be filled forward
frame.resample('D').ffill(limit=2)

frame.resample('W-THU').ffill()

Resampling with Periods

frame = pd.DataFrame(np.random.randn(24, 4),
                     index=pd.period_range('1-2000', '12-2001',
                                           freq='M'),
                     columns=['Colorado', 'Texas', 'New York', 'Ohio'])
frame[:5]

# Downward sampling, frequency 'A-DEC', year, December as the last month at the end of the year
annual_frame = frame.resample('A-DEC').mean()
annual_frame

# Sample upward and set the fill method fill(), with the default convention='start '
# Q-DEC: quarter, December is the last month of the last quarter
annual_frame.resample('Q-DEC').ffill()

# Sample upward and set the filling method fill()
# Pass convention='end ', set the index in the result to start at the end of the quarter of the original first index year and end at the end of the quarter of the original last index year
annual_frame.resample('Q-DEC', convention='end').ffill()

In down sampling, the frequency in the original sequence must be the sub interval of the frequency in the result; Month year
When sampling upward, the frequency in the original sequence must be the parent interval of the frequency in the result; Year month
Carefully figure out the following operations and understand them through drawing

annual_frame.resample('Q-MAR').ffill()

annual_frame.resample('Q-MAR', convention='end').ffill()

annual_frame.resample('Q-APR').ffill()

annual_frame.resample('Q-APR', convention='end').ffill()

7, Move window function

Like other statistical functions, the moving window function will automatically exclude missing data

close_px_all = pd.read_csv('examples/stock_px_2.csv',
                           parse_dates=True, index_col=0)
close_px = close_px_all[['AAPL', 'MSFT', 'XOM']]
close_px

# Resampling according to working day frequency
close_px = close_px.resample('B').ffill()
close_px

%matplotlib inline
close_px.AAPL.plot()
# rolling() can be called through a window (i.e. the number 250 passed in below) on Series and DataFrame
# Here is the grouping according to 250 sliding windows
close_px.AAPL.rolling(250).mean().plot()

# Transfer min_ If periods = 10, the standard deviation of the first 10 closed transactions will be calculated first, then the standard deviation of the first 11 transactions, and so on
# Until the first 250 groups of data are calculated, the moving window is used to calculate the standard deviation
appl_std250 = close_px.AAPL.rolling(250, min_periods=10).std()
appl_std250[5:12]

'''
2003-01-09         NaN
2003-01-10         NaN
2003-01-13         NaN
2003-01-14         NaN
2003-01-15    0.077496
2003-01-16    0.074760
2003-01-17    0.112368
Freq: B, Name: AAPL, dtype: float64
'''

appl_std250.plot()

# expanding() will make the window expand gradually. The left side will not move and the right side will expand to the right; Find the mean value of the extended window
# For example, here, 2003-01-16 is the average of the original 01-16 and 01-15
# For example, here, 2003-01-17 is the average value of the original 01-15 to 01-17
expanding_mean = appl_std250.expanding().mean()
expanding_mean[5:12]

'''
2003-01-09         NaN
2003-01-10         NaN
2003-01-13         NaN
2003-01-14         NaN
2003-01-15    0.077496
2003-01-16    0.076128
2003-01-17    0.088208
Freq: B, Name: AAPL, dtype: float64
'''

# Pass log = true and take the logarithm of y
close_px.rolling(60).mean().plot(logy=True)

# Pass' 20D ', fixed to 20 days of the calendar day for moving; In the previous pass, 250 days of the index column is taken as a window
close_px.rolling('20D').mean().head()

# Pass' 3D ', fixed to take the 3 days of the calendar day for moving average
# Therefore, the mean value of 2003-01-06 in the results remains unchanged because there are no data on No. 4 and No. 5
close_px.rolling('3D').mean().head()

Exponential weighting function

# Specify the attenuation factor through span=30, that is, the weights are 2 / 31, 2 / 31 * 29 / 31
aapl_px = close_px.AAPL['2006':'2007']
# Simple moving average
ma60 = aapl_px.rolling(30, min_periods=20).mean()
# Exponential weighted average
ewma60 = aapl_px.ewm(span=30).mean()
ma60.plot(style='k--', label='Simple MA')
ewma60.plot(style='k-', label='EW MA')
plt.legend()

Binary moving window function

Operate two time series at the same time, using the same moving window size, such as corr() or covariance

spx_px = close_px_all['SPX']
spx_rets = spx_px.pct_change()
spx_rets

'''
2003-01-02         NaN
2003-01-03   -0.000484
2003-01-06    0.022474
2003-01-07   -0.006545
2003-01-08   -0.014086
                ...   
2011-10-10    0.034125
2011-10-11    0.000544
2011-10-12    0.009795
2011-10-13   -0.002974
2011-10-14    0.017380
Name: SPX, Length: 2214, dtype: float64
'''

returns = close_px.pct_change()
returns

# Notice here, returns AAPL has 2292 rows of data, while spx_rets has 2214 rows of data
# Therefore, their row indexes are not exactly the same when calculating mobile windows
# Under normal circumstances, this will not happen. It may be the author's mistake
# corr() calculates the rolling correlation coefficient
corr = returns.AAPL.rolling(125, min_periods=100).corr(spx_rets)
corr.plot()

corr = returns.rolling(125, min_periods=100).corr(spx_rets)
corr.plot()

User defined moving window function

# Using apply() on rolling() and related methods can use custom array functions in the moving window
from scipy.stats import percentileofscore
# What percentage of samples x is less than 0.02
score_at_2percent = lambda x: percentileofscore(x, 0.02)
result = returns.AAPL.rolling(250).apply(score_at_2percent)
result.plot()

Topics: Python Data Analysis