Skip to content

🚴 Unlocking Insights: A Data-Driven Analysis of Cyclistic Bike-Share Trends

Divvy Project Phase 1 Ask Image

πŸ€” PHASE 1: ASK

Introduction

This study analyses the differences in behaviour between casual riders and annual members of Cyclistic Bike-Share, using data science techniques to uncover usage patterns. The goal is to support Cyclistic’s strategy of increasing annual memberships, as they are more profitable. Insights from the analysis will guide a targeted marketing campaign to convert casual users into annual members, helping the company grow inclusively and sustainably.

Company Background

Cyclistic, based in Chicago, provides a bike-share program with a fleet of bicycles available for public use across the city. They offer single-ride, full-day, and annual memberships. Operating on a subscription-based model, Cyclistic caters to both casual riders and annual members, emphasising an inclusive and accessible transportation option for diverse user groups.

Business Task: The objective is to analyse Cyclistic’s historical bike trip data to identify differences in how casual riders and annual members use the service. The insights from this analysis will inform a new marketing strategy aimed at encouraging casual riders to transition into annual members.

Key Stakeholders

StakeholderRole
Lily MorenoThe director of marketing and your manager. Responsible for developing campaigns.
Cyclistic Marketing Analytics TeamTasked with gathering, analysing, and reporting data to support marketing strategies.
Cyclistic Executive TeamResponsible for reviewing and approving the proposed marketing initiatives.

Key Questions

Assumptions and Limitations

Assumptions:

Limitations:

Divvy Project Phase 2 Prepare Image

πŸ‘©β€πŸ³ PREPARE PHASE

Data Location and Organisation

Data Credibility and Bias

Licensing, Privacy, Security, and Accessibility

Data Integrity

Relevance to the Business Question

The data contains information about trips by casual users and annual members, which is directly relevant to the business question of understanding how the two types of users use Cyclistic bikes differently. Analysing this data can provide insights to inform a marketing strategy aimed at converting casual users into annual members.

Potential Issues

Potential issues include:

The data may not contain additional useful information, such as demographic details of users or specific reasons for choosing casual travel over a subscription.

Key Tasks

  1. Download the data and store it appropriately: The provided code accomplishes this task by downloading and storing the data in a DataFrame.
  2. Identify how the data is organised: Explore the dataset to understand its structure and what each column represents.
# Define the base URL
base_url <- "https://divvy-tripdata.s3.amazonaws.com/"

# Define the list of zip files to download
zip_files <- c(
  "Divvy_Trips_2020_Q1.zip",
  "Divvy_Trips_2019_Q4.zip",
  "Divvy_Trips_2019_Q3.zip",
  "Divvy_Trips_2019_Q2.zip"
)

# Create a directory to store the files
if (!dir.exists("divvy_data")) {
  dir.create("divvy_data")
}

# Loop through the zip files and download/unzip them
for (file in zip_files) {
  file_url <- paste0(base_url, file)
  dest_file <- file.path("divvy_data", file)  # Save to the 'divvy_data' directory

  # Download the file
  download.file(file_url, destfile = dest_file, mode = "wb")

  # Unzip the file
  unzip(dest_file, exdir = "divvy_data")  # Extract to the 'divvy_data' directory

  # Remove the zip file after unzipping (optional)
  file.remove(dest_file)
}

print("Files downloaded and unzipped successfully!")
[1] "Files downloaded and unzipped successfully!"
# Downloading necessary packages
install.packages(c("tidyverse", "lubridate", "skimr", "janitor", "ggplot2"))

# Load the packages
library(tidyverse)  # For data manipulation and visualization
library(lubridate)  # For date-time handling
library(skimr)      # For summary statistics
library(janitor)    # For cleaning column names
library(ggplot2)    # For plots

library(dplyr)      # For data wrangling
Installing packages into β€˜/usr/local/lib/R/site-library’
(as β€˜lib’ is unspecified)

also installing the dependency β€˜snakecase’


── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
βœ” dplyr     1.1.4     βœ” readr     2.1.5
βœ” forcats   1.0.0     βœ” stringr   1.5.1
βœ” ggplot2   3.5.2     βœ” tibble    3.2.1
βœ” lubridate 1.9.4     βœ” tidyr     1.3.1
βœ” purrr     1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
βœ– dplyr::filter() masks stats::filter()
βœ– dplyr::lag()    masks stats::lag()
β„Ή Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Attaching package: β€˜janitor’


The following objects are masked from β€˜package:stats’:

    chisq.test, fisher.test
# Now that iI have downloaded the files and packages, I need to create paths for the datasets

files <- list.files(path = "/content/divvy_data", pattern = "*.csv", full.names = TRUE)

length(files)
print(files)

4

[1] "/content/divvy_data/Divvy_Trips_2019_Q2.csv"
[2] "/content/divvy_data/Divvy_Trips_2019_Q3.csv"
[3] "/content/divvy_data/Divvy_Trips_2019_Q4.csv"
[4] "/content/divvy_data/Divvy_Trips_2020_Q1.csv"
 # Here, I want to understand the column's names.
library(dplyr)

# File paths
files <- c(
  "/content/divvy_data/Divvy_Trips_2019_Q2.csv",
  "/content/divvy_data/Divvy_Trips_2019_Q3.csv",
  "/content/divvy_data/Divvy_Trips_2019_Q4.csv",
  "/content/divvy_data/Divvy_Trips_2020_Q1.csv"
)

# Function to get column names from a CSV file
get_column_names <- function(file_path) {
  tryCatch({
    readr::read_csv(file_path, n_max = 1) %>% names()
  }, error = function(e) {
    message(paste("Error reading file", file_path, ":", e$message))
    return(NULL) # Return NULL if there is an error
  })
}


# Get column names for each file
all_columns <- lapply(files, get_column_names)


# Print column names of each file
for (i in seq_along(files)){
  cat("Columns for file: ", files[i], "\n")
  print(all_columns[[i]])
  cat("\n")
}
Rows: 1 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): 03 - Rental Start Station Name, 02 - Rental End Station Name, User...
dbl  (6): 01 - Rental Details Rental ID, 01 - Rental Details Bike ID, 01 - R...
dttm (2): 01 - Rental Details Local Start Time, 01 - Rental Details Local En...

β„Ή Use `spec()` to retrieve the full column specification for this data.
β„Ή Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 1 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): from_station_name, to_station_name, usertype, gender
dbl  (5): trip_id, bikeid, from_station_id, to_station_id, birthyear
num  (1): tripduration
dttm (2): start_time, end_time

β„Ή Use `spec()` to retrieve the full column specification for this data.
β„Ή Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 1 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): from_station_name, to_station_name, usertype, gender
dbl  (6): trip_id, bikeid, tripduration, from_station_id, to_station_id, bir...
dttm (2): start_time, end_time

β„Ή Use `spec()` to retrieve the full column specification for this data.
β„Ή Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 1 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
dttm (2): started_at, ended_at

β„Ή Use `spec()` to retrieve the full column specification for this data.
β„Ή Specify the column types or set `show_col_types = FALSE` to quiet this message.


Columns for file:  /content/divvy_data/Divvy_Trips_2019_Q2.csv
 [1] "01 - Rental Details Rental ID"
 [2] "01 - Rental Details Local Start Time"
 [3] "01 - Rental Details Local End Time"
 [4] "01 - Rental Details Bike ID"
 [5] "01 - Rental Details Duration In Seconds Uncapped"
 [6] "03 - Rental Start Station ID"
 [7] "03 - Rental Start Station Name"
 [8] "02 - Rental End Station ID"
 [9] "02 - Rental End Station Name"
[10] "User Type"
[11] "Member Gender"
[12] "05 - Member Details Member Birthday Year"

Columns for file:  /content/divvy_data/Divvy_Trips_2019_Q3.csv
 [1] "trip_id"           "start_time"        "end_time"
 [4] "bikeid"            "tripduration"      "from_station_id"
 [7] "from_station_name" "to_station_id"     "to_station_name"
[10] "usertype"          "gender"            "birthyear"

Columns for file:  /content/divvy_data/Divvy_Trips_2019_Q4.csv
 [1] "trip_id"           "start_time"        "end_time"
 [4] "bikeid"            "tripduration"      "from_station_id"
 [7] "from_station_name" "to_station_id"     "to_station_name"
[10] "usertype"          "gender"            "birthyear"

Columns for file:  /content/divvy_data/Divvy_Trips_2020_Q1.csv
 [1] "ride_id"            "rideable_type"      "started_at"
 [4] "ended_at"           "start_station_name" "start_station_id"
 [7] "end_station_name"   "end_station_id"     "start_lat"
[10] "start_lng"          "end_lat"            "end_lng"
[13] "member_casual"
#  Here I want to understand its structure and what each column represents


# Function to read and glimpse a CSV file
glimpse_file <- function(file_path) {
  tryCatch({
    df <- readr::read_csv(file_path)
    glimpse(df)
    return(df) # Return the dataframe if needed later
  }, error = function(e) {
    message(paste("Error reading file", file_path, ":", e$message))
    return(NULL) # Return NULL in case of error
  })
}

# Loop through the files and get a glimpse
for (file in files) {
  cat("Structure for file:", file, "\n")
  glimpse_file(file)
  cat("\n")
}
Structure for file: /content/divvy_data/Divvy_Trips_2019_Q2.csv


Rows: 1108163 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): 03 - Rental Start Station Name, 02 - Rental End Station Name, User...
dbl  (5): 01 - Rental Details Rental ID, 01 - Rental Details Bike ID, 03 - R...
num  (1): 01 - Rental Details Duration In Seconds Uncapped
dttm (2): 01 - Rental Details Local Start Time, 01 - Rental Details Local En...

β„Ή Use `spec()` to retrieve the full column specification for this data.
β„Ή Specify the column types or set `show_col_types = FALSE` to quiet this message.


Rows: 1,108,163
Columns: 12
$ `01 - Rental Details Rental ID`                    <dbl> 22178529, 22178530,…
$ `01 - Rental Details Local Start Time`             <dttm> 2019-04-01 00:02:2…
$ `01 - Rental Details Local End Time`               <dttm> 2019-04-01 00:09:4…
$ `01 - Rental Details Bike ID`                      <dbl> 6251, 6226, 5649, 4…
$ `01 - Rental Details Duration In Seconds Uncapped` <dbl> 446, 1048, 252, 357…
$ `03 - Rental Start Station ID`                     <dbl> 81, 317, 283, 26, 2…
$ `03 - Rental Start Station Name`                   <chr> "Daley Center Plaza…
$ `02 - Rental End Station ID`                       <dbl> 56, 59, 174, 133, 1…
$ `02 - Rental End Station Name`                     <chr> "Desplaines St & Ki…
$ `User Type`                                        <chr> "Subscriber", "Subs…
$ `Member Gender`                                    <chr> "Male", "Female", "…
$ `05 - Member Details Member Birthday Year`         <dbl> 1975, 1984, 1990, 1…

Structure for file: /content/divvy_data/Divvy_Trips_2019_Q3.csv


Rows: 1640718 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): from_station_name, to_station_name, usertype, gender
dbl  (5): trip_id, bikeid, from_station_id, to_station_id, birthyear
num  (1): tripduration
dttm (2): start_time, end_time

β„Ή Use `spec()` to retrieve the full column specification for this data.
β„Ή Specify the column types or set `show_col_types = FALSE` to quiet this message.


Rows: 1,640,718
Columns: 12
$ trip_id           <dbl> 23479388, 23479389, 23479390, 23479391, 23479392, 23…
$ start_time        <dttm> 2019-07-01 00:00:27, 2019-07-01 00:01:16, 2019-07-0…
$ end_time          <dttm> 2019-07-01 00:20:41, 2019-07-01 00:18:44, 2019-07-0…
$ bikeid            <dbl> 3591, 5353, 6180, 5540, 6014, 4941, 3770, 5442, 2957…
$ tripduration      <dbl> 1214, 1048, 1554, 1503, 1213, 310, 1248, 1550, 1583,…
$ from_station_id   <dbl> 117, 381, 313, 313, 168, 300, 168, 313, 43, 43, 511,…
$ from_station_name <chr> "Wilton Ave & Belmont Ave", "Western Ave & Monroe St…
$ to_station_id     <dbl> 497, 203, 144, 144, 62, 232, 62, 144, 195, 195, 84, …
$ to_station_name   <chr> "Kimball Ave & Belmont Ave", "Western Ave & 21st St"…
$ usertype          <chr> "Subscriber", "Customer", "Customer", "Customer", "C…
$ gender            <chr> "Male", NA, NA, NA, NA, "Male", NA, NA, NA, NA, NA, …
$ birthyear         <dbl> 1992, NA, NA, NA, NA, 1990, NA, NA, NA, NA, NA, NA, …

Structure for file: /content/divvy_data/Divvy_Trips_2019_Q4.csv


Rows: 704054 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): from_station_name, to_station_name, usertype, gender
dbl  (5): trip_id, bikeid, from_station_id, to_station_id, birthyear
num  (1): tripduration
dttm (2): start_time, end_time

β„Ή Use `spec()` to retrieve the full column specification for this data.
β„Ή Specify the column types or set `show_col_types = FALSE` to quiet this message.


Rows: 704,054
Columns: 12
$ trip_id           <dbl> 25223640, 25223641, 25223642, 25223643, 25223644, 25…
$ start_time        <dttm> 2019-10-01 00:01:39, 2019-10-01 00:02:16, 2019-10-0…
$ end_time          <dttm> 2019-10-01 00:17:20, 2019-10-01 00:06:34, 2019-10-0…
$ bikeid            <dbl> 2215, 6328, 3003, 3275, 5294, 1891, 1061, 1274, 6011…
$ tripduration      <dbl> 940, 258, 850, 2350, 1867, 373, 1072, 1458, 1437, 83…
$ from_station_id   <dbl> 20, 19, 84, 313, 210, 156, 84, 156, 156, 336, 77, 19…
$ from_station_name <chr> "Sheffield Ave & Kingsbury St", "Throop (Loomis) St …
$ to_station_id     <dbl> 309, 241, 199, 290, 382, 226, 142, 463, 463, 336, 50…
$ to_station_name   <chr> "Leavitt St & Armitage Ave", "Morgan St & Polk St", …
$ usertype          <chr> "Subscriber", "Subscriber", "Subscriber", "Subscribe…
$ gender            <chr> "Male", "Male", "Female", "Male", "Male", "Female", …
$ birthyear         <dbl> 1987, 1998, 1991, 1990, 1987, 1994, 1991, 1995, 1993…

Structure for file: /content/divvy_data/Divvy_Trips_2020_Q1.csv


Rows: 426887 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
dttm (2): started_at, ended_at

β„Ή Use `spec()` to retrieve the full column specification for this data.
β„Ή Specify the column types or set `show_col_types = FALSE` to quiet this message.


Rows: 426,887
Columns: 13
$ ride_id            <chr> "EACB19130B0CDA4A", "8FED874C809DC021", "789F3C21E4…
$ rideable_type      <chr> "docked_bike", "docked_bike", "docked_bike", "docke…
$ started_at         <dttm> 2020-01-21 20:06:59, 2020-01-30 14:22:39, 2020-01-…
$ ended_at           <dttm> 2020-01-21 20:14:30, 2020-01-30 14:26:22, 2020-01-…
$ start_station_name <chr> "Western Ave & Leland Ave", "Clark St & Montrose Av…
$ start_station_id   <dbl> 239, 234, 296, 51, 66, 212, 96, 96, 212, 38, 117, 1…
$ end_station_name   <chr> "Clark St & Leland Ave", "Southport Ave & Irving Pa…
$ end_station_id     <dbl> 326, 318, 117, 24, 212, 96, 212, 212, 96, 100, 632,…
$ start_lat          <dbl> 41.9665, 41.9616, 41.9401, 41.8846, 41.8856, 41.889…
$ start_lng          <dbl> -87.6884, -87.6660, -87.6455, -87.6319, -87.6418, -…
$ end_lat            <dbl> 41.9671, 41.9542, 41.9402, 41.8918, 41.8899, 41.884…
$ end_lng            <dbl> -87.6674, -87.6644, -87.6530, -87.6206, -87.6343, -…
$ member_casual      <chr> "member", "member", "member", "member", "member", "…

Divvy Project Phase 3 process Image

πŸ” PROCESS PHASE

Guiding Questions

What tools are you choosing and why?

I chose Google Colab because it is a cloud-based platform that allows me to run R code without needing to install software on my computer. It supports many useful R packages such as:

These packages are essential for data cleaning, transformation, and visualisation. Google Colab also lets me save my work directly to Google Drive and makes it easy to share notebooks with others, which is convenient for collaboration and accessing my work from anywhere.

Have you ensured your data’s integrity?

Yes, I took steps to verify that the data was imported correctly and consistently from the original CSV files. I checked for:

I also ensured that no personal or sensitive data was included, in compliance with privacy rules.

What steps have I taken to ensure that your data is clean?

Data Cleaning Steps

To prepare the Divvy bike trip dataset for analysis, I implemented a data cleaning process involving these key steps:

  1. Data Consolidation and Standardization:

    • Merged Data: I combined data from multiple CSV files into a single dataframe using read.csv() and bind_rows().
    • Standardized Columns: I ensured consistent column names across all data sources using rename().
    • Adjusted Data Types: I used mutate() to ensure consistent data types for columns like ride_id.
    • Unified Categorical Values: I standardized the β€˜member_casual’ column to β€˜member’ and β€˜casual’ using tolower() and case_when().
  2. Handling Missing and Invalid Data:

    • Calculated Trip Duration: I calculated missing β€˜tripduration’ values using ymd_hms() and difftime().
    • Removed Invalid Rows: I filtered out rows with invalid timestamps and trips starting from β€œHQ QR” using filter().
    • Dropped Irrelevant Columns: I dropped columns with high numbers of missing values using select().

Data Verification

To verify the data quality after cleaning, I took these steps:

Have you documented your cleaning process so you can review and share those results?

Yes. All steps have been written and saved in an R script within Google Colab. I included comments (#) to explain each cleaning step, making it easy to review or share the process with the team or stakeholders.

Key Task: Check the Data for Errors

library(tidyverse)
#  changing the column names.

# Load each dataset
df_q2 <- read_csv("/content/divvy_data/Divvy_Trips_2019_Q2.csv")
df_q3 <- read_csv("/content/divvy_data/Divvy_Trips_2019_Q3.csv")
df_q4 <- read_csv("/content/divvy_data/Divvy_Trips_2019_Q4.csv")
df_q1_2020 <- read_csv("/content/divvy_data/Divvy_Trips_2020_Q1.csv")


# Rename columns in df_q2 to match other datasets
df_q2 <- df_q2 %>%
  rename(
    ride_id = `01 - Rental Details Rental ID`,
    started_at = `01 - Rental Details Local Start Time`,
    ended_at = `01 - Rental Details Local End Time`,
    bikeid = `01 - Rental Details Bike ID`,
    tripduration = `01 - Rental Details Duration In Seconds Uncapped`,
    from_station_id = `03 - Rental Start Station ID`,
    from_station_name = `03 - Rental Start Station Name`,
    to_station_id = `02 - Rental End Station ID`,
    to_station_name = `02 - Rental End Station Name`,
    member_casual = `User Type`,
    gender = `Member Gender`,
    birthyear = `05 - Member Details Member Birthday Year`
  )


# Rename columns in 2020 Q1 to align with the rest
df_q1_2020 <- df_q1_2020 %>%
  rename(
    ride_id = ride_id,
    started_at = started_at,
    ended_at = ended_at,
    from_station_name = start_station_name,
    from_station_id = start_station_id,
    to_station_name = end_station_name,
    to_station_id = end_station_id,
    member_casual = member_casual
  ) %>%
  mutate(
    bikeid = NA,
    tripduration = NA,
    gender = NA,
    birthyear = NA
  ) %>%
  select(
    ride_id, started_at, ended_at, bikeid, tripduration,
    from_station_id, from_station_name, to_station_id, to_station_name,
    member_casual, gender, birthyear
  )

# Ensure df_q3 and df_q4 have the same names as the others
df_q3 <- df_q3 %>%
  rename(
    ride_id = trip_id,
    started_at = start_time,
    ended_at = end_time,
    bikeid = bikeid,
    tripduration = tripduration,
    from_station_id = from_station_id,
    from_station_name = from_station_name,
    to_station_id = to_station_id,
    to_station_name = to_station_name,
    member_casual = usertype,
    gender = gender,
    birthyear = birthyear
  )

df_q4 <- df_q4 %>%
  rename(
    ride_id = trip_id,
    started_at = start_time,
    ended_at = end_time,
    bikeid = bikeid,
    tripduration = tripduration,
    from_station_id = from_station_id,
    from_station_name = from_station_name,
    to_station_id = to_station_id,
    to_station_name = to_station_name,
    member_casual = usertype,
    gender = gender,
    birthyear = birthyear
  )

# Ensure consistent column types across all dataframes
df_q2 <- df_q2 %>% mutate(ride_id = as.character(ride_id))
df_q3 <- df_q3 %>% mutate(ride_id = as.character(ride_id))
df_q4 <- df_q4 %>% mutate(ride_id = as.character(ride_id))
df_q1_2020 <- df_q1_2020 %>% mutate(ride_id = as.character(ride_id))

# Combine all into one DataFrame
df_all <- bind_rows(df_q2, df_q3, df_q4, df_q1_2020)

Rows: 1108163 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): 03 - Rental Start Station Name, 02 - Rental End Station Name, User...
dbl  (5): 01 - Rental Details Rental ID, 01 - Rental Details Bike ID, 03 - R...
num  (1): 01 - Rental Details Duration In Seconds Uncapped
dttm (2): 01 - Rental Details Local Start Time, 01 - Rental Details Local En...

β„Ή Use `spec()` to retrieve the full column specification for this data.
β„Ή Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 1640718 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): from_station_name, to_station_name, usertype, gender
dbl  (5): trip_id, bikeid, from_station_id, to_station_id, birthyear
num  (1): tripduration
dttm (2): start_time, end_time

β„Ή Use `spec()` to retrieve the full column specification for this data.
β„Ή Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 704054 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): from_station_name, to_station_name, usertype, gender
dbl  (5): trip_id, bikeid, from_station_id, to_station_id, birthyear
num  (1): tripduration
dttm (2): start_time, end_time

β„Ή Use `spec()` to retrieve the full column specification for this data.
β„Ή Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 426887 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
dttm (2): started_at, ended_at

β„Ή Use `spec()` to retrieve the full column specification for this data.
β„Ή Specify the column types or set `show_col_types = FALSE` to quiet this message.
# checking missing values / NA columns

# Create a summary table
summary_table <- data.frame(
  Column_Name = names(df_all),
  Rows_with_Values = sapply(df_all, function(x) sum(!is.na(x))),
  Rows_with_NA = sapply(df_all, function(x) sum(is.na(x)))
)

# Print the summary table
summary_table
A data.frame: 12 Γ— 3
Column_NameRows_with_ValuesRows_with_NA
<chr><int><int>
ride_idride_id 3879822 0
started_atstarted_at 3879822 0
ended_atended_at 3879822 0
bikeidbikeid 3452935426887
tripdurationtripduration 3452935426887
from_station_idfrom_station_id 3879822 0
from_station_namefrom_station_name3879822 0
to_station_idto_station_id 3879821 1
to_station_nameto_station_name 3879821 1
member_casualmember_casual 3879822 0
gendergender 2913440966382
birthyearbirthyear 2932207947615
 # Now  I need to fill the NA values ​​from the duration column.
 # I can do it because of the start time and end time.


library(lubridate)

# Parse timestamps, suppress warnings
df_all <- df_all %>%
  mutate(
    started_at = suppressWarnings(ymd_hms(started_at)),
    ended_at = suppressWarnings(ymd_hms(ended_at))
  )

# Calculating tripduration where missing, but only when timestamps are valid
df_all <- df_all %>%
  mutate(
    tripduration = if_else(
      is.na(tripduration),
      as.numeric(difftime(ended_at, started_at, units = "secs")),
      tripduration
    )
  )

# Removing rows where timestamps are invalid
df_all <- df_all %>%
  filter(!is.na(started_at) & !is.na(ended_at))

# Ensuring timestamps are properly parsed
df_all <- df_all %>%
  mutate(
    started_at = ymd_hms(started_at),
    ended_at = ymd_hms(ended_at)
  )

# Calculating tripduration where it is missing
df_all <- df_all %>%
  mutate(
    tripduration = if_else(
      is.na(tripduration),
      as.numeric(difftime(ended_at, started_at, units = "secs")), # duration in seconds
      tripduration
    )
  )
# Dropping rows where to_station_id or to_station_name, birthday, gender, bikeid are NA
df_all <- df_all %>%
  filter(!is.na(to_station_id) & !is.na(to_station_name))
# Preview the final result
glimpse(df_all)
df_all <- df_all %>%
  select(-c( birthyear, gender, bikeid))
Rows: 3,879,803
Columns: 12
$ ride_id           <chr> "22178529", "22178530", "22178531", "22178532", "221…
$ started_at        <dttm> 2019-04-01 00:02:22, 2019-04-01 00:03:02, 2019-04-0…
$ ended_at          <dttm> 2019-04-01 00:09:48, 2019-04-01 00:20:30, 2019-04-0…
$ bikeid            <dbl> 6251, 6226, 5649, 4151, 3270, 3123, 6418, 4513, 3280…
$ tripduration      <dbl> 446, 1048, 252, 357, 1007, 257, 548, 383, 2137, 2120…
$ from_station_id   <dbl> 81, 317, 283, 26, 202, 420, 503, 260, 211, 211, 304,…
$ from_station_name <chr> "Daley Center Plaza", "Wood St & Taylor St", "LaSall…
$ to_station_id     <dbl> 56, 59, 174, 133, 129, 426, 500, 499, 211, 211, 232,…
$ to_station_name   <chr> "Desplaines St & Kinzie St", "Wabash Ave & Roosevelt…
$ member_casual     <chr> "Subscriber", "Subscriber", "Subscriber", "Subscribe…
$ gender            <chr> "Male", "Female", "Male", "Male", "Male", "Male", "M…
$ birthyear         <dbl> 1975, 1984, 1990, 1993, 1992, 1999, 1969, 1991, NA, …
# checking NA again

# Creating a summary table
summary_table <- data.frame(
  Column_Name = names(df_all),
  Rows_with_Values = sapply(df_all, function(x) sum(!is.na(x))),
  Rows_with_NA = sapply(df_all, function(x) sum(is.na(x)))
)

# Print the summary table
summary_table
A data.frame: 9 Γ— 3
Column_NameRows_with_ValuesRows_with_NA
<chr><int><int>
ride_idride_id 38798030
started_atstarted_at 38798030
ended_atended_at 38798030
tripdurationtripduration 38798030
from_station_idfrom_station_id 38798030
from_station_namefrom_station_name38798030
to_station_idto_station_id 38798030
to_station_nameto_station_name 38798030
member_casualmember_casual 38798030
table(df_all$member_casual)

# cheching member casual values
    casual   Customer     member Subscriber
     48479     857467     378407    2595450
# standardising the member_casual column.  # First, forced everything to lowercase (to avoid mismatch)
df_all <- df_all %>%
  mutate(member_casual = tolower(member_casual))

# Then, fix the labels properly
df_all <- df_all %>%
  mutate(member_casual = case_when(
    member_casual %in% c("subscriber", "member") ~ "member",
    member_casual %in% c("customer", "casual") ~ "casual",
    TRUE ~ member_casual # in case there are other unknowns
  ))
table(df_all$member_casual)
 casual  member
 905946 2973857
# Average and median tripduration
df_all %>%
  group_by(member_casual) %>%
  summarise(
    avg_trip_min = mean(tripduration) / 60,   # in minutes
    median_trip_min = median(tripduration) / 60,
    max_trip_min = max(tripduration) / 60,
    min_trip_min = min(tripduration) / 60
  )
A tibble: 2 Γ— 5
member_casualavg_trip_minmedian_trip_minmax_trip_minmin_trip_min
<chr><dbl><dbl><dbl><dbl>
casual58.9640925.666667156450.4-9.20000000
member14.16328 9.816667150943.9 0.01666667
# looking for negative rows causing problem
negative_duration_trips <- df_all %>%
  filter(tripduration < 0)

head(negative_duration_trips)  # Check the first few rows to understand the issue
A tibble: 6 Γ— 10
ride_idstarted_atended_attripdurationfrom_station_idfrom_station_nameto_station_idto_station_namemember_casualride_length
<chr><dttm><dttm><dbl><dbl><chr><dbl><chr><chr><dbl>
9461DFF13D8BA8AD2020-02-28 10:09:432020-02-28 10:09:42 -1675HQ QR675HQ QRcasual -1
794264334D168D052020-03-18 11:43:092020-03-18 11:42:58-11675HQ QR675HQ QRcasual-11
3CC0D87545CE223B2020-03-19 09:43:052020-03-19 09:42:51-14675HQ QR675HQ QRcasual-14
32CAF6E2E7AB67CE2020-03-12 10:00:462020-03-12 10:00:36-10675HQ QR675HQ QRcasual-10
13BCB74BAFDADB212020-03-27 14:59:512020-03-27 14:58:18-93675HQ QR675HQ QRcasual-93
4EF350E51E6037AF2020-03-19 09:37:172020-03-19 09:37:04-13675HQ QR675HQ QRcasual-13
# Deleting the negatives row trip duration.
df_all$ride_length <- difftime(df_all$ended_at,df_all$started_at)
# Add a "ride_length" calculation to df_all (in seconds)

# Calculating the time difference using columns from df_all
df_all$ride_length <- difftime(df_all$ended_at, df_all$started_at)

# Inspecting the structure of df_all
str(df_all)

# Checking if ride_length is a factor (less likely, but good check)
is.factor(df_all$ride_length)

# Converting ride_length to a simple numeric value (seconds) using the original method
df_all$ride_length <- as.numeric(as.character(df_all$ride_length))

# Verifying that ride_length in df_all is now numeric
is.numeric(df_all$ride_length)
# Converting "ride_length" from Factor to numeric so we can run calculations on the data
is.factor(df_all$ride_length)
df_all$ride_length <- as.numeric(as.character(df_all$ride_length))

 df_all_2 <- df_all[!(df_all$from_station_name == "HQ QR" | df_all$tripduration<0),]
# Removing trips longer than 100 minutes (6000 seconds)
df_all_2 <- df_all_2[df_all_2$ride_length <= 6000, ]


tibble [3,879,803 Γ— 10] (S3: tbl_df/tbl/data.frame)
 $ ride_id          : chr [1:3879803] "22178529" "22178530" "22178531" "22178532" ...
 $ started_at       : POSIXct[1:3879803], format: "2019-04-01 00:02:22" "2019-04-01 00:03:02" ...
 $ ended_at         : POSIXct[1:3879803], format: "2019-04-01 00:09:48" "2019-04-01 00:20:30" ...
 $ tripduration     : num [1:3879803] 446 1048 252 357 1007 ...
 $ from_station_id  : num [1:3879803] 81 317 283 26 202 420 503 260 211 211 ...
 $ from_station_name: chr [1:3879803] "Daley Center Plaza" "Wood St & Taylor St" "LaSalle St & Jackson Blvd" "McClurg Ct & Illinois St" ...
 $ to_station_id    : num [1:3879803] 56 59 174 133 129 426 500 499 211 211 ...
 $ to_station_name  : chr [1:3879803] "Desplaines St & Kinzie St" "Wabash Ave & Roosevelt Rd" "Canal St & Madison St" "Kingsbury St & Kinzie St" ...
 $ member_casual    : chr [1:3879803] "member" "member" "member" "member" ...
 $ ride_length      : 'difftime' num [1:3879803] 446 1048 252 357 ...
  ..- attr(*, "units")= chr "secs"

FALSE

TRUE

FALSE

negative_duration_trips <- df_all_2 %>%
  filter(tripduration < 0)

head(negative_duration_trips)  # Checking negative values
A tibble: 0 Γ— 10
ride_idstarted_atended_attripdurationfrom_station_idfrom_station_nameto_station_idto_station_namemember_casualride_length
<chr><dttm><dttm><dbl><dbl><chr><dbl><chr><chr><dbl>

Divvy Project Phase 4 Analyze Image

πŸ“Š ANALYZE PHASE

Guiding Questions

How should you organize your data to perform analysis?

I began by aggregating the trip data by member type, day of the week, and trip duration. I made sure the dataset was structured in a way that allowed easy comparisons between casual and member users. This included adding columns for ride length, start and end day, and ensuring datetime formats were consistent.

Has your data been properly formatted?

Yes. I cleaned the dataset by removing duplicates, dropping rows with missing values (like station IDs), fixing inconsistent user-type values (e.g., combining β€œsubscriber” and β€œcustomer” into β€œmember” or β€œcasual”), and converting duration values to a usable numeric format.

What surprises did you discover in the data?

At first, I found unexpected labels in the member_casual column like β€œsubscriber” and β€œcustomer,” which weren’t supposed to be there. Also, the large drop in member usage on weekends and the spike in casual usage during weekends were more pronounced than I anticipated.

I discovered a strong contrast between member and casual users:

How will these insights help answer your business questions?

Key Tasks Completed:

Deliverable:

Summary of My Analysis

Cyclistic Case Study – Member vs. Casual User Analysis


1. Analysis of β€œMember” Users

Total Trips – Weekdays (Mon–Fri)

Weekday Total: 2,417,932 trips


Total Trips – Weekends (Sat–Sun)

Weekend Total: 555,917 trips


Overall Total Trips (β€œMember”)

2,417,932 + 555,917 = 2,973,849 trips


Usage Percentage


Percentage Variation (Weekend vs. Weekdays)


Conclusion:

There is an approximate 42.53% decrease in the average daily usage by members on weekends compared to weekdays.


2. Analysis of β€œCasual” Users

Total Trips – Weekdays (Mon–Fri)

Weekday Total: 511,342 trips


Total Trips – Weekends (Sat–Sun)

Weekend Total: 390,833 trips


Overall Total Trips (β€œCasual”)

511,342 + 390,833 = 902,175 trips


Usage Percentage


Percentage Variation (Weekend vs. Weekdays)


Conclusion:

There is an approximate 91.08% increase in the average daily usage by casual users on weekends compared to weekdays.


Summary of the Analysis

Trip Counts:


Approximate Average Trip Duration:

library(lubridate) # Summarise the number of rides day of week by member/casual

df_all_2 <- df_all_2 %>%
  mutate(day_of_week = weekdays(started_at))
library(dplyr)

df_all_2 %>%
  group_by(member_casual, day_of_week) %>%
  summarise(
    number_of_rides = n(),
    .groups = 'drop'
  ) %>%
  arrange(day_of_week, member_casual)
A tibble: 14 Γ— 3
member_casualday_of_weeknumber_of_rides
<chr><chr><int>
casualFriday 122404
memberFriday 452787
casualMonday 103296
memberMonday 472196
casualSaturday 209541
memberSaturday 287956
casualSunday 181298
memberSunday 267968
casualThursday 102675
memberThursday 484176
casualTuesday 90510
memberTuesday 508445
casualWednesday 92457
memberWednesday500328
# Now, let's run the average ride time by each day for members vs casual users
aggregate(df_all_2$ride_length ~ df_all_2$member_casual + df_all_2$day_of_week, FUN = mean)
A data.frame: 14 Γ— 3
df_all_2$member_casualdf_all_2$day_of_weekdf_all_2$ride_length
<chr><chr><dbl>
casualFriday 1744.0536
memberFriday 727.4415
casualMonday 1788.1691
memberMonday 731.9720
casualSaturday 1906.8800
memberSaturday 826.1389
casualSunday 1873.2639
memberSunday 829.2541
casualThursday 1677.3405
memberThursday 733.4064
casualTuesday 1676.0194
memberTuesday 731.6989
casualWednesday1669.3694
memberWednesday 734.5123
# Ensure you're filtering on ride_length itself:
df_all_2 <- df_all_2[df_all_2$ride_length >= 0, ]

# Add columns that list the date, month, day, and year of each ride
# This will allow us to aggregate ride data for each month, day, or year ... before completing
# Add columns for date, month, day, and year
df_all_2$date <- as.Date(df_all_2$started_at)  # The default format is yyyy-mm-dd
df_all_2$month <- format(as.Date(df_all_2$date), "%m")
df_all_2$day <- format(as.Date(df_all_2$date), "%d")
df_all_2$year <- format(as.Date(df_all_2$date), "%Y")
df_all_2$day_of_week <- format(as.Date(df_all_2$date), "%A")
summary(df_all_2$ride_length)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
      1     412     712    1479    1289 9387024
# Compare members and casual users
aggregate(df_all_2$ride_length ~ df_all_2$member_casual, FUN = mean)
aggregate(df_all_2$ride_length ~ df_all_2$member_casual, FUN = median)
aggregate(df_all_2$ride_length ~ df_all_2$member_casual, FUN = max)
aggregate(df_all_2$ride_length ~ df_all_2$member_casual, FUN = min)

A data.frame: 2 Γ— 2
df_all_2$member_casualdf_all_2$ride_length
<chr><dbl>
casual1790.3611
member 749.7632
A data.frame: 2 Γ— 2
df_all_2$member_casualdf_all_2$ride_length
<chr><dbl>
casual1457
member 589
A data.frame: 2 Γ— 2
df_all_2$member_casualdf_all_2$ride_length
<chr><dbl>
casual6000
member6000
A data.frame: 2 Γ— 2
df_all_2$member_casualdf_all_2$ride_length
<chr><dbl>
casual2
member1
# Extract Month and Year (or create a Month/Year column)
df_all_2 <- df_all_2 %>%
  mutate(
    start_month = month(started_at, label = TRUE, abbr = FALSE), # Month name
    start_year = year(started_at),
    year_month = floor_date(started_at, "month") # Date truncated to the start of the month
   )

# Count rides per Month/Year and user type
monthly_usage <- df_all_2 %>%
  group_by(member_casual, year_month) %>%
  summarise(ride_count = n(), .groups = 'drop') %>%
  arrange(member_casual, year_month) # Sort for line plots

# Count rides per Month (aggregating all years - for pure seasonality)
seasonal_usage <- df_all_2 %>%
  group_by(member_casual, start_month) %>%
  summarise(average_daily_rides_month = n() / length(unique(start_year)), # Avg daily rides per month
            total_rides_month = n(),
           .groups = 'drop')
           # Print seasonal summary (optional)
print("--- Total Ride Count per Month (Aggregated) ---")
print(seasonal_usage %>% select(member_casual, start_month, total_rides_month) %>% arrange(member_casual, start_month))
cat("\n")
[1] "--- Total Ride Count per Month (Aggregated) ---"
# A tibble: 24 Γ— 3
   member_casual start_month total_rides_month
   <chr>         <ord>                   <int>
 1 casual        January                  7292
 2 casual        February                11539
 3 casual        March                   23023
 4 casual        April                   44686
 5 casual        May                     75940
 6 casual        June                   121491
 7 casual        July                   162829
 8 casual        August                 173325
 9 casual        September              120956
10 casual        October                 67261
# β„Ή 14 more rows
# Count rides per hour and user type
hourly_usage <- df_all_2 %>%
  group_by(member_casual, start_hour) %>%
  summarise(ride_count = n(), .groups = 'drop')

# Find the peak hour(s) for each user type
peak_hours <- hourly_usage %>%
  group_by(member_casual) %>%
  slice_max(order_by = ride_count, n = 1) # Gets the hour with the most rides

# Print the found peak hours
print("--- Peak Hours (Hour with most rides) ---")
print(peak_hours)
cat("\n")
[1] "--- Peak Hours (Hour with most rides) ---"
# A tibble: 2 Γ— 3
# Groups:   member_casual [2]
  member_casual start_hour ride_count
  <chr>              <int>      <int>
1 casual                17      82785
2 member                17     390808
# --- Analysis 3: Most Popular Start/End Stations ---

# Define how many top stations to show
top_n_stations <- 10

# Most popular START stations
top_start_stations <- df_all_2 %>%
  group_by(member_casual, from_station_name) %>%
  summarise(start_count = n(), .groups = 'drop') %>%
  group_by(member_casual) %>% # Group again to get top N for each type
  slice_max(order_by = start_count, n = top_n_stations) %>%
  arrange(member_casual, desc(start_count))

# Most popular END stations
top_end_stations <- df_all_2 %>%
  group_by(member_casual, to_station_name) %>%
  summarise(end_count = n(), .groups = 'drop') %>%
  group_by(member_casual) %>%
  slice_max(order_by = end_count, n = top_n_stations) %>%
  arrange(member_casual, desc(end_count))

# (Optional) Most popular START-END station pairs
top_station_pairs <- df_all_2 %>%
  group_by(member_casual, from_station_name, to_station_name) %>%
  summarise(pair_count = n(), .groups = 'drop') %>%
  # Filter out trips starting and ending at the same station (optional)
  # filter(from_station_name != to_station_name) %>%
  group_by(member_casual) %>%
  slice_max(order_by = pair_count, n = top_n_stations) %>%
  arrange(member_casual, desc(pair_count))


# Print the station results
print(paste("--- Top", top_n_stations, "Most Popular START Stations ---"))
print(top_start_stations)
cat("\n")

print(paste("--- Top", top_n_stations, "Most Popular END Stations ---"))
print(top_end_stations)
cat("\n")

print(paste("--- Top", top_n_stations, "Most Popular START-END Pairs ---"))
print(top_station_pairs)
cat("\n")
[1] "--- Top 10 Most Popular START Stations ---"
# A tibble: 20 Γ— 3
# Groups:   member_casual [2]
   member_casual from_station_name            start_count
   <chr>         <chr>                              <int>
 1 casual        Streeter Dr & Grand Ave            50001
 2 casual        Lake Shore Dr & Monroe St          37340
 3 casual        Shedd Aquarium                     20035
 4 casual        Millennium Park                    19802
 5 casual        Michigan Ave & Oak St              19746
 6 casual        Lake Shore Dr & North Blvd         18298
 7 casual        Theater on the Lake                14654
 8 casual        Dusable Harbor                     11926
 9 casual        Adler Planetarium                  11460
10 casual        Michigan Ave & Washington St       11288
11 member        Canal St & Adams St                51908
12 member        Clinton St & Madison St            46144
13 member        Clinton St & Washington Blvd       43556
14 member        Columbus Dr & Randolph St          31016
15 member        Franklin St & Monroe St            30960
16 member        Kingsbury St & Kinzie St           30887
17 member        Daley Center Plaza                 30586
18 member        Canal St & Madison St              26141
19 member        Michigan Ave & Washington St       24833
20 member        LaSalle St & Jackson Blvd          22866

[1] "--- Top 10 Most Popular END Stations ---"
# A tibble: 20 Γ— 3
# Groups:   member_casual [2]
   member_casual to_station_name              end_count
   <chr>         <chr>                            <int>
 1 casual        Streeter Dr & Grand Ave          62939
 2 casual        Lake Shore Dr & Monroe St        28651
 3 casual        Millennium Park                  23517
 4 casual        Lake Shore Dr & North Blvd       22263
 5 casual        Michigan Ave & Oak St            22019
 6 casual        Theater on the Lake              18047
 7 casual        Shedd Aquarium                   16033
 8 casual        Michigan Ave & Washington St     11560
 9 casual        Adler Planetarium                10048
10 casual        Dusable Harbor                    9001
11 member        Canal St & Adams St              48793
12 member        Clinton St & Washington Blvd     47608
13 member        Clinton St & Madison St          44251
14 member        Daley Center Plaza               30821
15 member        Kingsbury St & Kinzie St         30384
16 member        Franklin St & Monroe St          27648
17 member        Michigan Ave & Washington St     27593
18 member        Canal St & Madison St            25088
19 member        Clark St & Elm St                23421
20 member        Larrabee St & Kingsbury St       22019

[1] "--- Top 10 Most Popular START-END Pairs ---"
# A tibble: 20 Γ— 4
# Groups:   member_casual [2]
   member_casual from_station_name            to_station_name         pair_count
   <chr>         <chr>                        <chr>                        <int>
 1 casual        Lake Shore Dr & Monroe St    Streeter Dr & Grand Ave       9060
 2 casual        Lake Shore Dr & Monroe St    Lake Shore Dr & Monroe…       7590
 3 casual        Streeter Dr & Grand Ave      Streeter Dr & Grand Ave       7470
 4 casual        Michigan Ave & Oak St        Michigan Ave & Oak St         4301
 5 casual        Shedd Aquarium               Streeter Dr & Grand Ave       3441
 6 casual        Streeter Dr & Grand Ave      Millennium Park               3042
 7 casual        Streeter Dr & Grand Ave      Lake Shore Dr & Monroe…       2760
 8 casual        Millennium Park              Streeter Dr & Grand Ave       2653
 9 casual        Millennium Park              Millennium Park               2512
10 casual        Dusable Harbor               Streeter Dr & Grand Ave       2483
11 member        Canal St & Adams St          Michigan Ave & Washing…       3542
12 member        Michigan Ave & Washington St Clinton St & Washingto…       2716
13 member        Columbus Dr & Randolph St    Clinton St & Washingto…       2507
14 member        Michigan Ave & Washington St Canal St & Adams St           2370
15 member        Canal St & Madison St        Michigan Ave & Washing…       2262
16 member        Columbus Dr & Randolph St    Canal St & Adams St           2185
17 member        Clinton St & Washington Blvd Michigan Ave & Washing…       2184
18 member        Columbus Dr & Randolph St    State St & Randolph St        2130
19 member        Wacker Dr & Washington St    Michigan Ave & Washing…       1994
20 member        Loomis St & Lexington St     Morgan St & Polk St           1954

Divvy Project Phase 1 Ask Image

πŸ“£ SHARE PHASE

Unpacking Cyclistic’s Dual Ecosystem

Guiding Questions

Were you able to answer the question of how annual members and casual riders use Cyclistic bikes differently?

Yes β€” and the difference is crystal clear. We uncovered two distinct user profiles with contrasting behaviors:

What story does your data tell?

Cyclistic isn’t a single service β€” it’s two services in one.

These insights point to a dual-ecosystem inside Cyclistic β€” one utilitarian, the other recreational. To serve both, we must adapt our operations and marketing to their unique rhythms.

What story do peak hours tell?

Both user groups peak at 5 PM, reflecting:

What day-of-week patterns emerged?

While Saturday is the ride champion for both groups, the type of ride couldn’t be more different: long joyrides for casuals vs. brief chores or errands for members. The Monday–Friday period is clearly dominated by the member crowd, emphasizing their workweek rhythm.

DayCasual Avg Ride (min)Member Avg Ride (min)Casual Total RidesMember Total Rides
Saturday1906.9826.1209,541287,956
Sunday1873.3829.3181,298267,968
Friday1744.1727.4122,404452,787
Monday1788.2732.0103,296472,196
… (and other weekdays)…………

What is the best way to communicate with your audience?

For stakeholders in marketing and product, clarity is key. That’s why I’ve designed visualizations that tell the story at a glance:

These visuals translate data into strategy cues β€” where to focus campaigns, how to time promotions, and what product features matter.

Can data visualization help?

Absolutely β€” it’s essential. It bridges the gap between complex data and quick understanding. For instance:

Key Tasks Completed:

Deliverable: Visualizations and Key Findings

I’ll export my visuals (e.g., plots of weekday vs. weekend usage by user type, average trip duration by user type, heatmaps of popular stations) and summarize the main conclusions alongside them.

🧭 Final Insight:

β€œTwo riders pedal through the same city β€” one chasing the clock, the other chasing the moment.” To thrive, Cyclistic must embrace both. We now understand when, how, and why each type rides β€” and that empowers us to design smarter services, targeted campaigns, and better experiences.

# Let's visualize the number of rides by rider type
df_all_2 %>%
  mutate(weekday = wday(started_at, label = TRUE)) %>%
  group_by(member_casual, weekday) %>%
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>%
  arrange(member_casual, weekday)  %>%
  ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge")
`summarise()` has grouped output by 'member_casual'. You can override using the
`.groups` argument.

Ride Duration by User Type and Month

# Count rides per hour and user type
hourly_usage <- df_all_2 %>%
  group_by(member_casual, start_hour) %>%
  summarise(ride_count = n(), .groups = 'drop')

# Find the peak hour(s) for each user type
peak_hours <- hourly_usage %>%
  group_by(member_casual) %>%
  slice_max(order_by = ride_count, n = 1) # Gets the hour with the most rides

# Visualize the hourly distribution
ggplot(hourly_usage, aes(x = start_hour, y = ride_count, fill = member_casual)) +
  geom_col(position = "dodge") + # Bars side-by-side
  # Or use facet_wrap for separate plots:
  # geom_col() +
  # facet_wrap(~member_casual, ncol = 1, scales = "free_y") +
  labs(
    title = "Distribution of Rides by Hour of Day",
    subtitle = "Comparison between Member and Casual users",
    x = "Hour of Day (Ride Start)",
    y = "Number of Rides",
    fill = "User Type"
  ) +
  scale_x_continuous(breaks = seq(0, 23, by = 1)) + # Ticks for every hour
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Ride Count by Month



# --- Analysis 2: Seasonality and Monthly Trends ---

# Extract Month and Year (or create a Month/Year column)
df_all_2 <- df_all_2 %>%
  mutate(
    start_month = month(started_at, label = TRUE, abbr = FALSE), # Month name
    start_year = year(started_at),
    year_month = floor_date(started_at, "month") # Date truncated to the start of the month
   )

# Count rides per Month/Year and user type
monthly_usage <- df_all_2 %>%
  group_by(member_casual, year_month) %>%
  summarise(ride_count = n(), .groups = 'drop') %>%
  arrange(member_casual, year_month) # Sort for line plots

# Count rides per Month (aggregating all years - for pure seasonality)
seasonal_usage <- df_all_2 %>%
  group_by(member_casual, start_month) %>%
  summarise(average_daily_rides_month = n() / length(unique(start_year)), # Avg daily rides per month
            total_rides_month = n(),
           .groups = 'drop')


# Visualize Monthly Trends Over Time
ggplot(monthly_usage, aes(x = year_month, y = ride_count, color = member_casual, group = member_casual)) +
  geom_line(linewidth = 1) +
  geom_point() +
  labs(
    title = "Monthly Ride Trend Over Time",
    subtitle = "Comparison between Member and Casual users",
    x = "Month/Year",
    y = "Total Number of Rides",
    color = "User Type"
  ) +
  # --- CORRECTED LINE ---
  # Use scale_x_datetime because year_month is POSIXct (datetime), not Date
  scale_x_datetime(date_breaks = "3 months", date_labels = "%Y-%m") +
  # --- END CORRECTION ---
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Visualize Seasonality (Average Annual Pattern)
ggplot(seasonal_usage, aes(x = start_month, y = total_rides_month, color = member_casual, group = member_casual)) +
  geom_line(linewidth = 1) +
  geom_point() +
  labs(
    title = "Average Ride Seasonality",
    subtitle = "Total rides per month (aggregated across all years)",
    x = "Month",
    y = "Total Number of Rides in Month",
    color = "User Type"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Ride Count by Day of Week

Ride Count by Month



# --- Analysis 3: Most Popular Start/End Stations ---

# Define how many top stations to show
top_n_stations <- 10

# Most popular START stations
top_start_stations <- df_all_2 %>%
  group_by(member_casual, from_station_name) %>%
  summarise(start_count = n(), .groups = 'drop') %>%
  group_by(member_casual) %>% # Group again to get top N for each type
  slice_max(order_by = start_count, n = top_n_stations) %>%
  arrange(member_casual, desc(start_count))

# Most popular END stations
top_end_stations <- df_all_2 %>%
  group_by(member_casual, to_station_name) %>%
  summarise(end_count = n(), .groups = 'drop') %>%
  group_by(member_casual) %>%
  slice_max(order_by = end_count, n = top_n_stations) %>%
  arrange(member_casual, desc(end_count))

# (Optional) Most popular START-END station pairs
top_station_pairs <- df_all_2 %>%
  group_by(member_casual, from_station_name, to_station_name) %>%
  summarise(pair_count = n(), .groups = 'drop') %>%
  # Filter out trips starting and ending at the same station (optional)
  # filter(from_station_name != to_station_name) %>%
  group_by(member_casual) %>%
  slice_max(order_by = pair_count, n = top_n_stations) %>%
  arrange(member_casual, desc(pair_count))

# Visualisation of Top Stations (Example for Start Stations)
 ggplot(top_start_stations, aes(x = reorder(from_station_name, start_count), y = start_count, fill = member_casual)) +
   geom_col() +
   facet_wrap(~member_casual, scales = "free_y", ncol = 1) + # Separate plots
   coord_flip() + # Flip axes for better readability of names
   labs(
     title = paste("Top", top_n_stations, "Start Stations by User Type"),
     x = "Start Station",
     y = "Number of Rides Started",
     fill = "User Type"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Top Start Stations by User Type

Divvy Project Phase 1 Ask Image

🎯 ACT PHASE

🚲 Cyclistic: A Dual-Track System – Insights & Strategic Recommendations

Guiding Conclusion: Cyclistic serves two user ecosystems with almost opposite patterns:

Our analysis reveals Cyclistic serves two distinct user groups with almost opposite patterns:

Conclusion: We don’t have β€œa user”—we have two. They need different pricing, different messaging, different bike placementβ€”and a different vision.

Applying Our Insights: Strategic Actions

Here’s what the numbers are screaming:

πŸ›  Product & Pricing Enhancements

πŸ“ˆ Data-Informed Campaigns & Targeted Messaging

πŸ”§ Dynamic Bike Redistribution

πŸ₯‡ My Top Three Recommendations for Stakeholders:

  1. Flexible Memberships: Casuals ride ~195k times/day on weekends. Monthly/seasonal offers tap into this segment’s potential without long-term pressure.
  2. Targeted Messaging for Leisure Users: Trip duration for casuals is ~60 minβ€”4x longer than members. They seek joyrides, not just commutes. Speak their language and promote campaigns like β€œCity Circles”.
  3. Dynamic Bike Redistribution: Weekday: ~483k member rides/day. Weekend: ~278k casual rides/day. Don’t leave bikes in the wrong places. Let the data steer the fleet.

🧳 My Final Words for Stakeholders

This is more than data. It’s a compass. Cyclistic isn’t a one-size-fits-all service. It’s a dual-track systemβ€”and our strategies must reflect that. With smart segmentation, seasonal offers, and targeted rebalancing, we don’t just growβ€”we evolve.

Let’s turn insight into impact. πŸ’ΌπŸ’₯