Example Analysis

Analysis and visualizations of the palmerpenguins dataset

This analysis used the palmerpenguins dataset to explore how species, body mass, and bill dimensions relate. It offers a concise, reproducible example of data wrangling, visualization, and interpretation for the web.

Research question

What are the morphological differences in bill length, depth, and body mass among penguin species, and how do these traits vary across islands?

Intended audience

This page is aimed at students, educators, and data science enthusiasts learning data exploration and visualization in R.

Data source

I used the palmerpenguins dataset hosted by the palmerpenguins project:

  • Data (CSV): https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv

Data dictionary

The Data dictionary, containing full variable descriptions can be found here: https://rpubs.com/rich_i/dd_pp

Key variables used in this dataset include:

  • species: species of penguin (Adelie, Chinstrap, Gentoo)
  • island: island name
  • bill_length_mm: bill length (mm)
  • bill_depth_mm: bill depth (mm)
  • flipper_length_mm: flipper length (mm)
  • body_mass_g: body mass (g)
  • sex: sex of the penguin (male/female)
  • year: the year of study (2007, 2008, and 2009)

This is a quick reminder about the research question and the data source.

  • Question: morphological differences in species and variance by island.
  • Data: palmerpenguins CSV (linked above).

Data wrangling and inspection

We load the tidyverse (dplyr, tidyr, ggplot2) to clean and reshape the data using functions like filter, select,mutate, arrange, group_by, summarize, and drop_na.

NoteData and reproducibility

The data is loaded directly from the project’s CSV URL, allowing this page to render reproducibly without relying on the palmerpenguins package.

# load libraries
library(readr)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)

# read data from raw CSV
peng <- read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv")
Rows: 344 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): species, island, sex
dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Wrangle: keep key variables, drop missing values, create a mass_kg variable, filter to common species
peng_clean <- peng %>%
    select(species, island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex) %>%
    drop_na(bill_length_mm, bill_depth_mm, body_mass_g, sex) %>%
    mutate(body_mass_kg = body_mass_g / 1000,
                 bill_ratio = bill_length_mm / bill_depth_mm) %>%
    filter(species %in% c("Adelie", "Chinstrap", "Gentoo")) %>%
    arrange(species, island)

# Quick summary table by species
summary_by_species <- peng_clean %>%
    group_by(species) %>%
    summarize(n = n(),
                        mean_mass_g = mean(body_mass_g),
                        sd_mass_g = sd(body_mass_g),
                        mean_bill_length = mean(bill_length_mm))

summary_by_species
# A tibble: 3 × 5
  species       n mean_mass_g sd_mass_g mean_bill_length
  <chr>     <int>       <dbl>     <dbl>            <dbl>
1 Adelie      146       3706.      459.             38.8
2 Chinstrap    68       3733.      384.             48.8
3 Gentoo      119       5092.      501.             47.6
TipQuick findings (preview)

Overall, Gentoo penguins tend to be heavier with longer bills than Adelie and Chinstrap species.

Visualizations

Below, three visualizations are then created with different geom_*() functions, complete with titles, subtitles, captions, and readable axis labels. One includes faceting

1. Scatter: bill length vs bill depth (geom_point)

ggplot(peng_clean, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
    geom_point(alpha = 0.7) +
    labs(title = "Bill dimensions by species",
             subtitle = "Bill length vs bill depth",
             caption = "Data: Palmer Penguins (raw CSV)",
             x = "Bill length (mm)",
             y = "Bill depth (mm)") +
    theme_minimal()

2. Boxplots: body mass by species (geom_boxplot) with faceting by island

ggplot(peng_clean, aes(x = species, y = body_mass_g, fill = species)) +
    geom_boxplot(alpha = 0.9) +
    facet_wrap(~ island) +
    labs(title = "Body mass distribution by species and island",
             subtitle = "Boxplots of body mass (g), faceted by island",
             caption = "Faceted by island to show local differences",
             x = "Species",
             y = "Body mass (g)") +
    theme_minimal() +
    theme(legend.position = "none")

3. Bar chart: counts by species and sex (geom_bar)

ggplot(peng_clean, aes(x = species, fill = sex)) +
    geom_bar(position = "dodge") +
    labs(title = "Counts of penguins by species and sex",
             subtitle = "Simple count of observations in the dataset",
             caption = "Note: counts reflect available non-missing sex values",
             x = "Species",
             y = "Count") +
    theme_minimal()

NoteAdditional plot incorporated from: https://github.com/allisonhorst/palmerpenguins/blob/main/vignettes/intro.Rmd

4. Scatterplot: penguin flipper length versus body mass

ggplot(peng_clean, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species, 
                 shape = species),
             size = 2) +
  scale_color_manual(values = c("purple4","gray51","gold1")) 

The figures below are cartoon illustrations of the palmerpenguins from vignette(“art”) and Allison Horst.

Palmer Penguins

Meet the Penguins

Additional resources

  • Full dataset documentation: https://allisonhorst.github.io/palmerpenguins/articles/intro.html

Functions used

Below are the key functions used from each package for clarity and reproducibility:

  • dplyr: filter, select, arrange, mutate, group_by, summarize

  • tidyr: drop_na

  • ggplot2: geom_point, geom_boxplot, geom_bar, facet_wrap, labs

Summary and conclusions

This brief analysis shows that Gentoo penguins generally have greater body mass and longer bills than Adelie and Chinstrap species. Faceting by island highlights subtle local differences, suggesting possible ecological influences. The bill length–depth scatterplot reveals partial species separation with some overlap, pointing to the value of multivariate approaches like PCA. Overall, the page illustrates simple, reproducible steps for data wrangling, visualization, and quick exploratory analysis.

References

This page includes citations for the dataset and core software, listed in the bibliography section.