Gradient Phylogeny: Map Occurrence Data With RStudio

Jan 5, 2026 by Andrew McMorgan 53 views

Hey guys! Ever found yourself staring at a beautiful phylogenetic tree, wishing you could overlay some cool occurrence data onto it? You know, like mapping where different species in your tree have been spotted in museum collections or across geographical locations? It sounds awesome, right? Well, it is awesome, but getting it to work can sometimes feel like wrestling a grumpy badger. I've been there, and let me tell you, the error messages can be a real mood killer. But don't you worry, because today we're diving deep into how to add occurrence data as a gradient on a phylogeny using RStudio. We'll break down the common pitfalls and get your trees looking chef's kiss with all that juicy occurrence data visualized. So, grab your coffee, settle in, and let's make those phylogenies sing with information!

Understanding Your Data: Phylogeny and Occurrence

Alright, first things first, let's get our heads around what we're actually working with. You've got a phylogeny, which is basically a family tree for species or genes. It shows how different things are related through evolutionary history. Think of it like your own family tree, but way, way older and a lot more branchy. On the other hand, you have your occurrence data. This is where things get exciting! This data typically comes from museum collections, field surveys, or any source that tells you where a specific species or taxon has been found. It's usually in a CSV file, which is super handy. Each row might represent a unique observation, with columns for the species name (which needs to match your phylogeny!), latitude, longitude, or perhaps a count of occurrences in a particular region. The goal here is to visualize this occurrence data as a gradient on your phylogeny. Imagine each tip of your tree having a color intensity that represents, say, the number of times that species was recorded in a specific area, or the diversity of its geographical range. This adds a whole new layer of biological insight to your tree, showing evolutionary relationships alongside ecological or geographical patterns. It's like giving your phylogenetic tree a superpower – it doesn't just show relationships, it shows where and how often these evolutionary lineages have interacted with the world.

Why Visualize Occurrence Data on a Phylogeny?

So, why bother with this whole process, you ask? Great question, guys! Visualizing occurrence data as a gradient on a phylogeny isn't just about making pretty pictures (though they are pretty cool). It's about unlocking deeper biological narratives. For instance, if you have a species complex that's spread across different continents, mapping its occurrences can reveal patterns of geographical expansion or isolation. You might see that a particular clade is heavily concentrated in a specific region, suggesting a endemism hotspot or a period of rapid diversification linked to that locale. Furthermore, occurrence data can reflect ecological niche evolution. By plotting where species live, you can start to infer environmental factors that might have driven their diversification. Are certain lineages thriving in arid environments while others dominate tropical rainforests? Your occurrence data, mapped onto the phylogeny, can provide strong clues. This approach is invaluable for studying biogeography, understanding speciation events, and even informing conservation strategies. If a critically endangered species has a very restricted occurrence, seeing that clearly on its phylogenetic branch is a powerful conservation message. It helps us prioritize areas for protection and understand the evolutionary history that led to its current status. In short, combining phylogeny with occurrence data transforms a static evolutionary diagram into a dynamic map of life's history and present distribution. It allows us to ask and answer questions that neither dataset could answer alone, making it a truly powerful tool in the modern biologist's arsenal.

Common Challenges When Mapping Occurrence Data

Now, let's talk turkey – or rather, let's talk about the bugs in the system. Many of us who have tried to add occurrence data as a gradient on a phylogeny have hit a wall, and it usually involves a rather unfriendly error message. One of the biggest headaches is data format incompatibility. Your occurrence data CSV might have species names like "Homo sapiens" with italics, while your phylogeny's tip labels might be just "Homo_sapiens" or "H_sapiens". R is notoriously picky about these things! Mismatched names, extra spaces, or different capitalization can lead to your data not being recognized or, worse, being assigned to the wrong tips on the tree. Another common stumbling block is data structure. Sometimes, your occurrence data might have multiple entries per species (e.g., from different collection sites). How do you summarize this into a single gradient value for each tip? Do you average the occurrences? Sum them? Take the maximum? The method you choose to aggregate your occurrence data before mapping it can significantly impact the visualization and interpretation. Then there's the issue of package compatibility and function arguments. There are several R packages designed for phylogenetic visualization (like ggtree, ape, phytools, castor), and each has its own way of handling annotations and gradients. What works seamlessly in one might throw cryptic errors in another, often due to subtle differences in how they expect the tree or annotation data to be formatted. Misinterpreting a function's arguments or using an outdated function can lead to a cascade of errors. Finally, missing data is another culprit. If a species in your phylogeny doesn't have any corresponding occurrence data, how should that be represented? Should it be a blank spot, a default color, or should it throw an error? Handling these missing values gracefully is crucial for a clean and informative visualization. It's these little details, guys, that often turn a quick visualization task into an all-day debugging session!

Step-by-Step Guide: Mapping Occurrence Data with `ggtree`

Alright, let's roll up our sleeves and get this done using one of the most popular and flexible tools for phylogenetic visualization in R: the ggtree package. It's built on top of ggplot2, so if you're familiar with that, you're already halfway there! First, you need to make sure you have the necessary packages installed. If not, just run:

install.packages("ggtree")
install.packages("ape") # For reading tree files
install.packages("dplyr") # For data manipulation

Now, let's load them up:

library(ggtree)
library(ape)
library(dplyr)

Next, we need our data. Let's assume you have your phylogenetic tree in a Newick format file (e.g., my_tree.tre) and your occurrence data in a CSV file (e.g., occurrence_data.csv). The CSV should have at least two columns: one for the species name (let's call it species) and one for the value you want to map as a gradient (e.g., num_occurrences). Crucially, the species names in your CSV must match the tip labels in your tree.

Let's load the tree and the data. We'll use read.tree from the ape package for the tree and read.csv for the data.

# Load the tree
tree <- read.tree("my_tree.tre")

# Load the occurrence data
occ_data <- read.csv("occurrence_data.csv")

# Inspect the data (optional but recommended!)
print(head(tree$tip.label))
print(head(occ_data))

Before we can plot, we need to ensure our data is aligned with the tree. A common issue is that the order of tips in the tree might not match the order in our CSV. ggtree works best when the annotation data is a data frame where the first column contains the tip labels. Let's prepare our occ_data.

# Make sure species names in occ_data match tree tip labels exactly.
# Let's assume 'species' column in occ_data needs to be renamed to match the tree's tip labels.
# If your tree tip labels are like 'Genus_species', and your CSV has 'Genus species',
# you might need to clean them up. For this example, let's assume they match or can be easily coerced.

# Let's rename the species column to 'label' which is what ggtree expects
occ_data <- occ_data %>%
  rename(label = species) # Adjust 'species' if your column name is different

# Now, ensure the data frame has the correct structure for ggtree
# We want to associate 'num_occurrences' with each 'label'
# If you have multiple occurrences per species and want to summarize, do it here.
# For simplicity, let's assume occ_data already has one row per species with a summarized value.

# If you have multiple entries per species and want to sum occurrences:
# occ_summary <- occ_data %>%
#   group_by(label) %>%
#   summarise(total_occurrences = sum(num_occurrences))
# Then use occ_summary instead of occ_data below.

# Let's proceed assuming occ_data is ready with a 'label' column and a 'num_occurrences' column.

Now for the plotting magic! We'll use ggtree to plot the tree and then add the occurrence data as a gradient using geom_tippoint or geom_highlight with color mapping.

# Plot the tree and add the occurrence data as a gradient
p <- ggtree(tree, layout = "rectangular") + # Or "fan", "cladogram", etc.
     geom_tippoint(aes(color = num_occurrences), size = 4) + # Map color to num_occurrences
     scale_color_gradient(low = "blue", high = "red", name = "Occurrences") + # Define the gradient colors and legend title
     theme_tree2() # A cleaner theme for trees

# Display the plot
print(p)

In this code, geom_tippoint is used to draw points at the tips of the tree. The aes(color = num_occurrences) tells ggtree to color these points based on the num_occurrences column from our occ_data. scale_color_gradient defines the color scale, mapping low values to blue and high values to red. You can customize low, high, and the name of the legend. theme_tree2() provides a nice aesthetic. Remember to replace `