Polynomial regression, local regression, kernel smoothing and smoothing spline regression models in R language

Posted by manalnor on Tue, 08 Mar 2022 19:24:26 +0100

Turn:

Polynomial regression, local regression, kernel smoothing and smoothing spline regression models in R language

Original link: http://tecdat.cn/?p=20531

In the standard linear model, we assume. When the linear assumption cannot be satisfied, other methods can be considered.

polynomial regression

The extension may assume some polynomial functions,

Similarly, in the standard linear model method (using the conditional normal distribution of GLM), the parameter  can be obtained by the least square method, where  is in.

Even if this polynomial model is not a real polynomial model, it may still be a good approximation. In fact, according to the ， stone Weierstrass theorem, if ， is continuous in an interval, there is a unified approximation through a polynomial function.

For illustration only, please consider the following data sets

 db = data.frame(x=xr,y=yr)
plot(db)

Standard regression line

reg = lm(y ~ x,data=db)
abline(reg,col="red")

Consider some polynomial regression. If the degree of polynomial function is large enough, any model can be obtained,

reg=lm(y~poly(x,5),data=db)

However, if the number of times is too large, too many "fluctuations" will be obtained,

reg=lm(y~poly(x,25),data=db)

And estimates may be unreliable: if we change a point, a (local) change may occur

 yrm=yr;yrm[31]=yr[31]-2 
lines(xr,predict(regm),col="red")

Local regression

In fact, if our interest is that there is a good approximation locally, why not use local regression?

This can be easily done using weighted regression, which we consider in the least squares formula

Here, I consider the linear model, but I can consider any polynomial model. In this case, the optimization problem is

It can be solved because

For example, if we want to make a prediction at a certain time, consider. Using this model, we can delete observations that are too far away,

A more general idea is to consider some kernel functions, give the weight function, and give some bandwidth of the neighborhood length (usually expressed as h),

This is actually the so-called nadaraya Watson function estimator.
In the previous case, we considered the unified core,

However, using this weight function with strong discontinuity is not the best choice. Try Gaussian kernel,

This can be used

 w=dnorm((xr-x0))
reg=lm(y~1,data=db,weights=w)

On our dataset, we can draw

 w=dnorm((xr-x0))
plot(db,cex=abs(w)*4)
lines(ul,vl0,col="red")
axis(3)
axis(2)
reg=lm(y~1,data=db,weights=w)
u=seq(0,10,by=.02)
v=predict(reg,newdata=data.frame(x=u))
lines(u,v,col="red",lwd=2)

Here, we need to perform local regression at point 2. The horizontal line below is regression (the size of the point is proportional to the width). The red curve is the evolution of local regression

Let's use animation to visualize the curve.

However, for some reasons, I can't easily install the package on Linux. We can use loops to generate some graphics

 name=paste("local-reg-",100+i,".png",sep="")
png(name,600,400)
 

for(i in 1:length(vx0)) graph (i)

Then I use

Of course, the local linear model can be considered,

 return(predict(reg,newdata=data.frame(x=x0)))}

Even quadratic (local) regression,

 lm(y~poly(x,degree=2), weights=w)

Of course, we can change the bandwidth

Note that in fact, we have to choose the weight function (the so-called kernel). However, there are (simple) ways to choose the "best" bandwidth h. The idea of cross validation is to consider

Is a prediction obtained using local regression.

We can try some real data.

library(XML)
 
data = readHTMLTable(html)

Organize data sets,

 plot(data$no,data$mu,ylim=c(6,10))
segments(data$no,data$mu-1.96*data$se,

We calculate the standard error to reflect the uncertainty.

 for(s in 1:8){reg=lm(mu~no,data=db, 
lines((s predict(reg)[1:12]

It is not a good assumption that all seasons should be considered completely independent.

 smooth(db$no,db$mu,kernel = "normal",band=5)

We can try to see curves with larger bandwidth.

db$mu[95]=7
  
plot(data$no,data$mu

lines(NW,col="red")

Spline smoothing

Next, the smoothing method in regression is discussed. Let's assume that it's some unknown function, but let's assume it's smooth enough. For example, suppose ， is continuous, exists, and is continuous, exists, and is also continuous, and so on. If {it is smooth enough, Taylor expansion can be used. Therefore, for

It can also be written as

The first part is just a polynomial.

Using the Riemann integral, it is observed that

So,

We have a linear regression model. A natural idea is to consider regression

Give some nodes.

plot(db)

If we consider a node and extend order 1,

 B=bs(xr,knots=c(3),Boundary.knots=c(0,10),degre=1)
 
lines(xr[xr<=3],predict(reg)[xr<=3],col="red")
lines(xr[xr>=3],predict(reg)[xr>=3],col="blue")

The prediction obtained with this spline can be compared with the regression on the subset (dotted line).

 lines(xr[xr<=3],predict(reg)[xr<=3


lm(yr~xr,subset=xr>=3)

This is different because here we have three parameters (about the regression of two subsets). When the model loses a degree of freedom. It is observed that it can be written equivalently

 lm(yr~bs(xr,knots=c(3),Boundary.knots=c(0,10)

The functions that appear in the regression are as follows

Now, if we regress these two components, we get

 matplot(xr,B

abline(v=c(0,2,5,10),lty=2)

If we add a node, we get

Forecast is

 lines(xr,predict(reg),col="red")

We can select more nodes

 lines(xr,predict(reg),col="red")

We can get a confidence interval

 polygon(c(xr,rev(xr)),c(P[,2],rev(P[,3]))

points(db)

If we keep the two previously selected nodes, but consider Taylor's second-order expansion, we get

 matplot(xr,B,type="l")
abline(v=c(0,2,5,10),lty=2)

If we consider constants and the first part based on splines, we get

 B=cbind(1,B)
lines(xr,B[,1:k]%*%coefficients(reg)[1:k],col=k-1,lty=k-1)

If we add the constant term, the first term and the second term, the part we get is on the left before the first node,

k=3
lines(xr,B[,1:k]%*%coefficients(reg)[1:k]

Through the three terms in the spline based matrix, we can get the part between the two nodes,

 lines(xr,B[,1:k]%*%coefficients(reg)[1:k]

Finally, when we sum them, this time is the right part after the last node,

k=5

This is the result of using quadratic spline regression with two (fixed) nodes. Confidence intervals can be obtained as before

 polygon(c(xr,rev(xr)),c(P[,2],rev(P[,3]))

points(db)
lines(xr,P[,1],col="red")

Using the function, you can ensure the continuity of points.

Again, using linear splines, you can add continuity constraints,

 lm(mu~bs(no,knots=c(12*(1:7)+.5),Boundary.knots=c(0,97),


lines(c(1:94,96),predict(reg),col="red")

But we can also consider quadratic splines,

 abline(v=12*(0:8)+.5,lty=2)

lm(mu~bs(no,knots=c(12*(1:7)+.5),Boundary.knots=c(0,97),