Photo by Greg Rakozy on Unsplash
Data Science and Statistical Inference in R: A First Attempt
Our journey in completing a hypothesis test from start to finish.
Hello!
I recently embarked on a long, gruelling journey (i.e. a term in university) to learn about statistical inference through the lens of a data scientist. We familiarized ourselves with the data science workflow while building skills working in R, a programming language for statistical computation. My team put together this project to put our knowledge to the test and showcase what we learned.
Check out our full project on GitHub, and the incredible group of contributors that made it happen: github.com/dliviya/STAT201-Dwarf_Star_Class..
The Dataset
We will be using this Star dataset to predict star types dataset from Kaggle.
The original dataset contains 6 features
:
Variable | Description |
Absolute Temperature (K) | Temperature of the star in Kelvin |
Relative Luminosity (L/Lo) | Luminosity of the star divided by the average luminosity of the sun (3.828 x 10^26 Watts) |
Relative Radius (R/Ro) | Radius of the star divided by the average radius of the sun (6.9551 x 10^8 m) |
Absolute Magnitude (Mv) | Absolute magnitude of the star |
Star Color | Color of the star |
Spectral Class (O,B,A,F,G,K,M) | Type of main sequence star, if it is a main sequence star |
Star Type (Red Dwarf, Brown Dwarf, White Dwarf, Main Sequence , SuperGiants, HyperGiants) | Type of star |
Analysis Question
In this project, we will determine if there is a difference in luminosity and temperature between red, brown, and white dwarf stars.
Stars can be classified into four different groups: white dwarfs, main sequence, giants, and supergiants. In the dataset, we also find two other types of stars: red dwarfs and brown dwarfs. Despite sharing similar names, dwarfs are not typically classified together. Naturally, we would want to explore the differences between the three dwarf star types.
We can do so by completing the following steps adapted from workflow above:
- Cleaning and Wrangling Data
- Exploratory Data Analysis
- Hypothesis Testing
- Interpreting Results
Cleaning and Wrangling Our Data
As always, we first load the R packages that we will be using.
Among these are the tidyverse
and infer
packes, which are popular for data science and statistical inference. digest
and broom
to help keep everything tidy and easy to work with.
library(tidyverse)
library(infer)
library(gridExtra)
library(testthat)
library(digest)
library(broom)
Next, we read
our CSV file into a variable we will be calling stars
. Note that we renamed the CSV file after downloading it from Kaggle.
stars <- read.csv("6-class.csv")
glimpse(stars)
glimpse
prints out the first few rows of the dataset and reveals that there are 240 rows in total.
Since we are treating the target variable Star.type
as a category of stars, we need to change its type from int
to factor
using the as_factor
function.
stars <- mutate(stars, Star.type = as_factor(Star.type))
Finally, we filter
out the giant stars (types 3, 4 and 5) to isolate the dwarfs. Try using glimpse
once again on the resulting data set dwarf_count
and we will find that each star type has exactly 40 samples .
dwarfs <- stars %>%
filter(Star.type != 5) %>%
filter(Star.type != 4) %>%
filter(Star.type != 3)
dwarf_count <- dwarfs %>%
group_by(Star.type)%>%
summarise(n = n())
Exploratory Data Analysis
EDA refers to the initial investigation of the dataset to determine its main characteristics, spot any potential anomalies, and make sure that the question we want answered actually pertains to this dataset. This will help us determine what tests to conduct and check that our dataset fulfills the assumptions/conditions needed for that test to work properly.
In our case, we found earlier that there are 240 data points in our dataset. This is considered a large dataset, meaning something called the Central Limit Theorem takes effect and we can essentially treat our dataset as normally distributed. Normally distributed data is a condition of many statistical methods.
Boxplots
Visualizations are a good way to add to our analysis. One way we can begin is by creating boxplots with to show the variation in luminosity and temperature across star types. This will give us a rough idea of the differences between each star type, and tell us if any groups are vastly different from the rest.
Here we are using ggplot
and specifying geom_boxplot
.
options(repr.plot.width = 10, repr.plot.height = 8)
L_box <- ggplot(dwarfs, aes(x = Star.type, y = Luminosity.L.Lo.)) + geom_boxplot() +
ggtitle("Figure 1b: Luminosity vs Star Type") + xlab("Star Type") + ylab("Luminosity (Lo)")
T_box <- ggplot(dwarfs, aes(x = Star.type, y = Temperature..K.)) + geom_boxplot() +
ggtitle("Figure 1a: Temp. vs Star Type") + xlab("Star Type") + ylab("Temperature (K)")
grid.arrange(T_box, L_box, ncol=2, widths = c(4,4))
Above we can see the distributions of Luminosity vs. Star Type and Temperature vs. Star Type.
Without going too deeply into statistical terms, we can clearly see that Type 2 looks very different from Type 0 and Type 1 in Figure 1a. Specifically, its interquartile range represented by the rectangle is much wider than the other two. Figure 1b shows a similar trend with a wider Type 1.
To further find the within group distributions for each star type, we will once again use filter
to separate the data into 3 parts, once for each star type.
type_0 <- filter(stars, Star.type == 0) #brown dwarfs
type_1 <- filter(stars, Star.type == 1) #red dwarfs
type_2 <- filter(stars, Star.type == 2) #white dwarfs
Histograms
Another way to view our data is to use ggplot
to plot a histogram by specifying geom_histogram
.
# Brown dwarf histogram
options(repr.plot.width = 10, repr.plot.height = 8)
L_0 <- ggplot(type_0, aes(x = Luminosity.L.Lo.))+ xlab("Luminosity (L/Lo)") + geom_histogram(bins = 20) + ggtitle("Figure 2b: Luminosity distribution of Brown Dwarf")
T_0 <- ggplot(type_0, aes(x = Temperature..K.))+ xlab("Temperature (K)") + geom_histogram(bins = 20) + ggtitle("Figure 2a: Temp. distribution of Brown Dwarf")
grid.arrange(T_0, L_0, ncol=2,widths = c(4,4))
# White dwarf histogram
options(repr.plot.width = 10, repr.plot.height = 8)
L_2 <- ggplot(type_2, aes(x = Luminosity.L.Lo.))+ xlab("Luminosity (L/Lo)") + geom_histogram(bins = 20)+ ggtitle("Figure 4b: Luminosity distribution of White Dwarf")
T_2 <- ggplot(type_2, aes(x = Temperature..K.))+ xlab("Temperature (K)") + geom_histogram(bins = 20)+ ggtitle("Figure 4a: Temp. distribution of White Dwarf")
grid.arrange(T_2, L_2, ncol=2)
Paying attention to the temperature histogram, we can see the distribution of the brown dwarf is skewed with more data points to the right of the graph, while the white dwarf is skewed to the left. This suggest that the two dwarfs may have different means. The red dwarf is omitted here, but the process is the same.
Brown Dwarf Distribution
White Dwarf Distribution
Great - we now have a sense of the distribution within our three groups. For the second part of our EDA we will compute some parameter estimates.
Computing Parameter Estimates
Parameters are descriptive measures of our population in question, which would be all the stars in the universe. Because our goal is to determine whether there exists a difference between the means, it would be really nice to know the true means of all the brown, red, and white stars in the universe . Unfortunately, we almost never the resources to obtain that kind of data, which is why we compute parameter estimates, or sample statistics, in other words.
The process for each dwarf type is identical, with the only difference being in whether we pass type_0
, type_1
, or type_2
to the function summarise
. Although we could have done this in a for-loop, we just repeated the process a few times for the sake of clarity in our report.
brown_dwarf_stats <- type_0 %>%
summarise(mean_lum = mean(Luminosity.L.Lo.),
sd_lum = sd(Luminosity.L.Lo.),
mean_temp = mean(Temperature..K.),
sd_temp = sd(Temperature..K.)) %>%
add_column(Type = "Brown", .before = "mean_lum")
red_dwarf_stats <- type_1 %>%
summarise(mean_lum = mean(Luminosity.L.Lo.),
sd_lum = sd(Luminosity.L.Lo.),
mean_temp = mean(Temperature..K.),
sd_temp = sd(Temperature..K.)) %>%
add_column(Type = "Red", .before = "mean_lum")
white_dwarf_stats <- type_2 %>%
summarise(mean_lum = mean(Luminosity.L.Lo.),
sd_lum = sd(Luminosity.L.Lo.),
mean_temp = mean(Temperature..K.),
sd_temp = sd(Temperature..K.)) %>%
add_column(Type = "White", .before = "mean_lum")
These three code blocks produce three different tables, each with one row and four columns. Using the rbind
function, we can combine these.
parameter_estimates <- rbind(brown_dwarf_stats, red_dwarf_stats, white_dwarf_stats)
parameter_estimates
Above is a summary of all the statistics for each star type.
For the luminosity feature, the mean and standard error of brown dwarfs is significantly smaller than the mean of the red and white dwarfs by a power of 10.
For the temperature feature, the mean of white dwarfs is larger than that of the red and brown dwarfs by a power of 10.
What we notice here supports our hypothesis, and reinforces that the method we are going to use next will work.
Hypothesis Testing: ANOVA
ANOVA (Analysis of Variance) is a statistical method used for hypothesis testing that determines whether there exists a difference between the means of three or more groups. It works by computing the variance within the groups and the variance overall to produce a ratio known as an F-value, and an important probability known as the p-value.
null hypothesis
: The commonly accepted fact. In this case, the null hypothesis is that the means of all three dwarf star groups are equal to each other.alternate hypothesis
: The opposite of the null hypothesis. In this case, the alternate hypothesis is that there is a difference in luminosity and temperature between red, brown, and white dwarf stars.F-value
: A value calculated by variation between sample means / variation within sample means.p-value
: The probability of obtaining the observed results assuming that the null hypothesis is true.
It is important to note that ANOVA only needs to prove that one mean differs from the other two. The null hypothesis states that all three means are equal to each other. To reject the null hypothesis, it is not necessary for all three to be different from each other.
Temperature
To begin, we select the columns that we want: Temperature..K.
, a numeric variable and the target of the current test, and Star.type
, an explanatory factor specifying the three dwarf star groups.
dwarfs_temp <- dwarfs %>%
select(Temperature..K., Star.type)
This snippet performs the actual ANOVA test and prints the results in a tidy table.
anova_temp <- aov(Temperature..K.~Star.type, data = dwarfs_temp)%>%
tidy()
print("Table 10: ANOVA results for Temperature")
print(anova_temp)
Finally, we filter
and select
the information we want. That is, the statistic
and p.value
.
f_stat_temp <- anova_temp %>%
filter(term == "Star.type") %>%
select(statistic) %>%
as.numeric()
anova_pval_temp <- anova_temp %>%
filter(term == "Star.type") %>%
select(p.value) %>%
as.numeric()
print(f_stat_temp)
print(anova_pval_temp)
Now we have completed the ANOVA test on the temperature feature and obtained the F-value = 188.91, and p-value = 2.74e-37.
Luminosity
To perform the ANOVA test on the luminosity feature, we repeat the same steps as above, this time using Luminosity.L.Lo.
instead of Temperature..K.
. The following code block condenses all the steps in the previous section so you can get a better sense of the workflow.
dwarfs_lumi <- dwarfs %>%
select(Luminosity.L.Lo., Star.type)
anova_lumi <- aov(Luminosity.L.Lo.~Star.type, data = dwarfs_lumi)%>%
tidy()
f_stat_lumi <- anova_lumi %>%
filter(term == "Star.type") %>%
select(statistic) %>%
as.numeric()
anova_pval_lumi <- anova_lumi%>%
filter(term == "Star.type") %>%
select(p.value) %>%
as.numeric()
print(f_stat_lumi)
print(anova_pval_lumi)
Now that we've completed the ANOVA test on the luminosity feature, we obtained the F-value = 5.09, and the p-value = 0.0076.
Results
Run this code to display the results obtained from both tests together...
table <- data.frame (Type = c("F-Value", "p-Value"),
Temperature_Result = c(f_stat_temp, anova_pval_temp),
Luminosity_Result = c(f_stat_lumi, anova_pval_lumi))
table
And voilà - the final results of our hypothesis tests! All that remains now is interpreting these results.
Interpreting Results
Temperature Results
The F-value: is 188.91 and p-value is 2.74e-37. An F-value significantly greater than 1 means that the variation between sample means is significantly large, relative to the variation within the samples. From a statistical standpoint this indicates that at least one of the mean values is indeed different across three groups.
Luminosity Results
The F-value for luminosity obtained is 5.09, and the p-value is 0.0076. We get a much smaller F-value this time but it is still clearly larger than 1. Even though this difference in luminosity across the groups is not as great as what was found in temperature, we can still confidently say that at least one mean value is different from the others.
Our Conclusion
The results of our ANOVA test indicate that there exist differences between the three types of dwarfs in terms of temperature and luminosity. It is common in statistics to select a p-value of 0.05 to be the "cut-off" point of statistical significance. Since both our p-values are smaller than 0.05, we are able to reject the null hypothesis. This means that the probability of what has just occurred happening assuming our null hypothesis is true is so slim that we will reject this null hypothesis in favour of our alternate hypothesis.
We are now relatively confident that the three types of dwarfs do not have the same average temperature and luminosity.
Wrapping Up
And that's it - We have now successfully completed a full hypothesis test and report using R.
The steps we took are definitely not fool-proof but you can read up on our potential sources of error, the future directions for this project, and more statistics mumble-jumble on our full project on GitHub.
Thanks for joining me!