Catalogue of series articles
Experiment 1 R language data structure, data import and data processing
Experiment 2 basic data processing
Experiment 3 data visualization
Experiment 5 comprehensive application
experimental data
1. systation dataset
systation.csv contains the longitude and latitude data of relevant lines of Shenyang metro station. The fields are described in the following table:
station | Station name | line | line number |
line_id | Station number | gps_lon | Longitude of station |
gps_lat | Dimension of station |
2. SY-20150401 data set
SY-20150401.csv provides all-in-one card swiping data. The fields are described in the following table:
V1 | Card number | V2 | Card swiping date |
V3 | Card swiping time | V4 | Card swiping site |
V5 | Card type | V6 | Ticket Price |
V7 | Is it preferential |
Note: after entering data, rename the field name to c("card_id", "date", "time", "station", "vehicle", "money", "property")
Tip: the following is the main content of this article
1, Experimental purpose
After learning some data processing methods of R, it is recommended to learn other third-party packages for data processing, such as data table.
Recommended learning packages: dplyr, tidyr and lubridate
2, Experimental content
Title 1 data processing [only SY-20150401.csv]
Count the flow in and out of each station every 5 minutes (Note: 00:00:01 in the first 5 minutes and 00:10:13 in the third 5 minutes). Since you may take the subway many times a day, query the latest departure time according to the card number and arrival time as the departure time.
Reference function lubridate::hms, lubridate:: period_to_seconds.
The final result of processing: dataframe (name trade.metro.in.out).
card.id | Card id | time.in | Arrival time |
line.in | Inbound route | station.in | Inbound station |
M5.in | What's the fifth minute | time.out | Outbound time |
line.out | Outbound line | station.out | Outbound site |
money | Ticket Price | M5.out | The 5th minute exit |
duration | Ride time (in seconds) |
The form submitted shall be in accordance with card ID sort. The output result displayed in the Notebook is head(trade.metro.in.out,10), and the local file shmetro is output_ line_ in_ out. csv.
Topic 2 Statistics of traffic between inbound and outbound
Make statistics through the dataframe (trade.metro.in.out) in topic 1 to count the traffic between inbound and outbound; then select the top 10 sites with the largest traffic and view the top 6 in the Notebook.
3, Implementation process and experimental results
Title 1 data processing [only SY-20150401.csv]
Count the flow in and out of each station every 5 minutes (Note: 00:00:01 in the first 5 minutes and 00:10:13 in the third 5 minutes). Since you may take the subway many times a day, query the latest departure time according to the card number and arrival time as the departure time.
The form submitted shall be in accordance with card ID sort. The output result displayed in the Notebook is head(trade.metro.in.out,10) and the local file shmetro is output_ line_ in_ out. csv.
Final result of processing: dataframe (name trade.metro.in.out)
card.id | Card id | time.in | Arrival time |
line.in | Inbound route | station.in | Inbound station |
M5.in | What's the fifth minute | time.out | Outbound time |
line.out | Outbound line | station.out | Outbound site |
money | Ticket Price | M5.out | The 5th minute exit |
duration | Ride time (in seconds) |
# Read data df <- read.csv("R\\data\\ex5\\SY-20150401.csv")
# Data dimension dim(df) ## [1] 1048575 7
# View data head(df) ## X2002062859 X2015. 4.1 X16. 41.37 Metro X0 of Industrial Exhibition Hall Station of line x2 non preferential ## 1 3103531702 2015 / 4 / 1 9:23:11 Metro Line 2 Financial Center Station 4 non preferential ## 2 3001771678 2015 / 4 / 1 16:41:23 Metro 0 Non discount at 13th Street Station of line 1 ## 3 400679513 2015 / 4 / 1 16:47:09 line 1 Yingbin Road Station subway 0 Non preferential ## 4 3102186320 2015 / 4 / 1 18:58:08 line 1 subway station on 7th Street 4 non preferential ## 5 740476054 2015 / 4 / 1 11:36:33 subway 0 of Shifu Square Station of line 2 non preferential ## 6 2301712030 2015 / 4 / 1 18:04:23 Kaifa Avenue Station Metro 0 Non preferential
# Field rename colnames(df) ## [1] "X2002062859" "X2015.4.1" "X16.41.37" ## [4] "Industrial Exhibition Hall Station of line x2", "subway", "X0" ## [7] "Non preferential" colnames(df) <- c('card.id','date','time','station','vehicle','money','property') colnames(df) ## [1] "card.id" "date" "time" "station" "vehicle" "money" "property"
# Take out the data with card swiping type of subway trade.metro <- df[df$vehicle == "metro",]
# The card swiping station (Financial Center Station of line 2) is divided into two fields line(2) and station according to line # install.packages("tidyr") library(tidyr) trade.metro <- separate(trade.metro, station, c('line', 'station'), sep = 'Line No') head(trade.metro) ## card.id date time line station vehicle money property ## 1 3103531702 2015 / 4 / 1 9:23:11 2 Financial Center Station subway 4 non discount ## 2 3001771678 2015 / 4 / 1 16:41:23 1 13th Street Station subway 0 Non discount ## 3 400679513 2015 / 4 / 1 16:47:09 1 Yingbin Road Station subway 0 Non discount ## 4 3102186320 2015 / 4 / 1 18:58:08 1 7th Street Station subway 4 non discount ## 5 740476054 2015 / 4 / 1 11:36:33 2 Shifu Square Station subway 0 Non discount ## 6 2301712030 2015 / 4 / 1 18:04:23 1 Kaifa Avenue station subway 0 Non preferential
# A package dedicated to data processing # mutate() creates a new variable mutate (the name of the data frame to be modified, the name of the new variable to be created = the value to be assigned to the new variable) # install.packages('dplyr') library(dplyr) # Date time processing # period_to_seconds converts the timestamp in seconds; hms (seconds, minutes, hours, days) is used to create an object of type hms # install.packages('lubridate') library(lubridate) # %>%The pipeline operation forwards the result of one value or expression to the next function call / expression # Count time by 5 minutes trade.metro <- trade.metro %>% mutate(M5 = ceiling(period_to_seconds(hms(time)) / 300)) head(trade.metro) ## card.id date time line station vehicle money property M5 ## 1 3103531702 2015 / 4 / 1 9:23:11 2 Financial Center Station subway 4 non discount 113 ## 2 3001771678 2015 / 4 / 1 16:41:23 1 13th Street Station subway 0 Non discount 201 ## 3 400679513 2015 / 4 / 1 16:47:09 1 Yingbin Road Station subway 0 Non discount 202 ## 4 3102186320 2015 / 4 / 1 18:58:08 1 subway station on 7th Street 4 non discount 228 ## 5 740476054 2015 / 4 / 1 11:36:33 2 Shifu Square Station subway 0 Non discount 140 ## 6 2301712030 2015 / 4 / 1 18:04:23 1 Kaifa Avenue station subway 0 Non discount 217
# Inbound data statistics: money=0 trade.metro.in <- trade.metro %>% filter(money == 0) %>% select( card.id, "time.in" = time, "line.in" = line, "station.in" = station, "M5.in" = M5 ) head(trade.metro.in) ## card.id time.in line.in station.in M5.in ## 1 3001771678 16:41:23 1 13th Street Station 201 ## 2 400679513 16:47:09 1 Yingbin Road Station 202 ## 3 740476054 11:36:33 2 Shifu Square Station 140 ## 4 2301712030 18:04:23 1 Kaifa Avenue Station 217 ## 5 2302067279 11:30:22 2 Medical University Station 139 ## 6 2202626963 18:59:08 1 Central Street Station 228
# Outbound data statistics, inbound money > 0 trade.metro.out <- trade.metro %>% filter(money > 0) %>% select( card.id, "time.out" = time, "line.out" = line, "station.out" = station, "M5.out" = M5 ) head(trade.metro.out) ## card.id time.out line.out station.out M5.out ## 1 3103531702 9:23:11 2 Financial Center Station 113 ## 2 3102186320 18:58:08 1 7th Street Station 228 ## 3 2802877677 18:58:49 1 Tiexi Square Station 228 ## 4 2101871077 19:00:57 1 Zhangshi station 229 ## 5 2503379038 11:32:21 2 Shifu Square Station 139 ## 6 2402957072 9:22:36 2 Financial Center Station 113
# Merge the inbound data and outbound data to calculate the ride duration, and take out the data with the ride duration greater than 0 trade.metro.in.out <- merge(trade.metro.in, trade.metro.out, by = "card.id") %>% mutate(duration = (period_to_seconds(hms(time.out)) - period_to_seconds(hms(time.in)))) %>% filter(duration > 0) head(trade.metro.in.out) ## card.id time.in line.in station.in M5.in time.out line.out ## 1 6032 14:17:33 2 University of Aeronautics and Astronautics station 172 15:34:24 2 ## 2 56702 8:45:15 1 Kaifa Avenue Station 106 14:09:12 1 ## 3 68597 17:26:19 2 municipal library station 210 17:30:53 2 ## 4 71082 10:30:22 2 Financial Center Station 127 10:59:12 2 ## 5 71082 9:37:14 2 Xinle site station 116 10:59:12 2 ## 6 71082 9:37:14 2 Xinle site station 116 9:59:40 2 ## station.out M5.out duration ## 1 University of Aeronautics and Astronautics station 187 4611 ## 2 Kaifa Avenue Station 170 19437 ## 3 Wulihe station 211 274 ## 4 Xinle ruins station 132 1730 ## 5 Xinle ruins station 132 4918 ## 6 Financial Center station 120 1346
# Query the latest outbound time according to the card number and inbound time as the outbound time # install.packages("data.table") library(data.table) trade.metro.in.out <- data.table(trade.metro.in.out) trade.metro.in.out[, duration_min := min(duration), by=list(card.id, M5.in)] trade.metro.in.out <- trade.metro.in.out %>% filter(duration==duration_min) %>% select(-duration_min) head(trade.metro.in.out) ## card.id time.in line.in station.in M5.in time.out line.out ## 1: 6032 14:17:33 2 University of Aeronautics and Astronautics station 172 15:34:24 2 ## 2: 56702 8:45:15 1 Kaifa Avenue Station 106 14:09:12 1 ## 3: 68597 17:26:19 2 municipal library station 210 17:30:53 2 ## 4: 71082 10:30:22 2 Financial Center Station 127 10:59:12 2 ## 5: 71082 9:37:14 2 Xinle site station 116 9:59:40 2 ## 6: 72845 8:46:10 1 13th Street Station 106 19:43:11 1 ## station.out M5.out duration ## 1: University of Aeronautics and Astronautics station 187 4611 ## 2: Kaifa Avenue Station 170 19437 ## 3: Wulihe station 211 274 ## 4: Xinle ruins station 132 1730 ## 5: Financial center station 120 1346 ## 6: 13th Street Station 237 39421
# According to card ID to sort and display the top 10 items trade.metro.in.out <- trade.metro.in.out[order(trade.metro.in.out$card.id),] head(trade.metro.in.out,10) ## card.id time.in line.in station.in M5.in time.out line.out ## 1: 6032 14:17:33 2 University of Aeronautics and Astronautics station 172 15:34:24 2 ## 2: 56702 8:45:15 1 Kaifa Avenue Station 106 14:09:12 1 ## 3: 68597 17:26:19 2 municipal library station 210 17:30:53 2 ## 4: 71082 10:30:22 2 Financial Center Station 127 10:59:12 2 ## 5: 71082 9:37:14 2 Xinle site station 116 9:59:40 2 ## 6: 72845 8:46:10 1 13th Street Station 106 19:43:11 1 ## 7: 796 7:53:16 1 Huaiyuanmen station 95 22:19:52 1 ## 8: 88873 14:01:19 1 Tiexi Square Station 169 14:17:02 1 ## 9: 90573 7:47:43 1 Yingbin Road Station 94 8:21:41 1 ## 10: 95504 5:55:08 1 South Market Station 72 6:29:52 1 ## station.out M5.out duration ## 1: University of Aeronautics and Astronautics station 187 4611 ## 2: Kaifa Avenue Station 170 19437 ## 3: Wulihe station 211 274 ## 4: Xinle ruins station 132 1730 ## 5: Financial center station 120 1346 ## 6: 13th Street Station 237 39421 ## 7: Huaiyuanmen station 268 51996 ## 8: Yuhong Square Station 172 943 ## 9: Pangjiang Street station 101 2038 ## 10: 7th Street Station 78 2084
# Export local file write.csv(trade.metro.in.out,"shmetro_line_in_out.csv",row.names = F)
Topic 2 Statistics of traffic between inbound and outbound
Statistics of inbound and outbound traffic: make statistics through the dataframe (trade.metro.in.out) of topic 1 to count the inbound and outbound traffic; Then select the top 10 sites with the largest traffic and view the top 6 in the Notebook.
# Press station in,station. Out grouping trade.metro.in.out <- data.table(trade.metro.in.out) a <- trade.metro.in.out[, .(passenger.flow = .N), keyby = .(station.in, station.out)] head(a, 10) ## station.in station.out passenger.flow ## 1: No. 7 Street Station No. 7 Street Station 4470 ## 2: No. 7 Street Station Santaizi station 54 ## 3: No. 7 Street Station century building station 18 ## 4: No.7 Street Station East Middle Street Station 165 ## 5: No. 7 Street Station University of traditional Chinese medicine Station 92 ## 6: No.7 Street Station Central Street Station 224 ## 7: No.7 Street station, Middle Street Station 147 ## 8: No. 7 Street station at Hong Square Station 442 ## 9: No.7 Street Station Yunfeng North Street Station 226 ## 10: No.7 Street Station Wulihe station 46
# Take out the top 10 according to the flow in and out of the station a <- a[order(a$passenger.flow, decreasing = TRUE), ] b <- a[1:10, ]
# Show the first 6 head(b, 6) ## station.in station.out passenger.flow ## 1: 13th Street Station 6345 13th Street Station ## 2: No. 7 Street Station No. 7 Street Station 4470 ## 3: University of Aeronautics and Astronautics station 3035 ## 4: Quanyun Road Station quanyun Road Station 2717 ## 5: Yingpan Street Station ## 6: Pangjiang Street Station Pangjiang Street Station 1996