Data Science and Statistical Inference in R: A First Attempt

Photo by Greg Rakozy on Unsplash

Data Science and Statistical Inference in R: A First Attempt

Our journey in completing a hypothesis test from start to finish.

Hello!

I recently embarked on a long, gruelling journey (i.e. a term in university) to learn about statistical inference through the lens of a data scientist. We familiarized ourselves with the data science workflow while building skills working in R, a programming language for statistical computation. My team put together this project to put our knowledge to the test and showcase what we learned.

Check out our full project on GitHub, and the incredible group of contributors that made it happen: github.com/dliviya/STAT201-Dwarf_Star_Class..

The Dataset

We will be using this Star dataset to predict star types dataset from Kaggle.
The original dataset contains 6 features:

VariableDescription
Absolute Temperature (K)Temperature of the star in Kelvin
Relative Luminosity (L/Lo)Luminosity of the star divided by the average luminosity of the sun (3.828 x 10^26 Watts)
Relative Radius (R/Ro)Radius of the star divided by the average radius of the sun (6.9551 x 10^8 m)
Absolute Magnitude (Mv)Absolute magnitude of the star
Star ColorColor of the star
Spectral Class (O,B,A,F,G,K,M)Type of main sequence star, if it is a main sequence star
Star Type (Red Dwarf, Brown Dwarf, White Dwarf, Main Sequence , SuperGiants, HyperGiants)Type of star

Analysis Question

In this project, we will determine if there is a difference in luminosity and temperature between red, brown, and white dwarf stars.

Stars can be classified into four different groups: white dwarfs, main sequence, giants, and supergiants. In the dataset, we also find two other types of stars: red dwarfs and brown dwarfs. Despite sharing similar names, dwarfs are not typically classified together. Naturally, we would want to explore the differences between the three dwarf star types.

data_science_pipeline.png

We can do so by completing the following steps adapted from workflow above:

  • Cleaning and Wrangling Data
  • Exploratory Data Analysis
  • Hypothesis Testing
  • Interpreting Results

Cleaning and Wrangling Our Data

As always, we first load the R packages that we will be using.

Among these are the tidyverse and infer packes, which are popular for data science and statistical inference. digest and broom to help keep everything tidy and easy to work with.

library(tidyverse)
library(infer)
library(gridExtra)
library(testthat)
library(digest)
library(broom)

Next, we read our CSV file into a variable we will be calling stars. Note that we renamed the CSV file after downloading it from Kaggle.

stars <- read.csv("6-class.csv")
glimpse(stars)

glimpse prints out the first few rows of the dataset and reveals that there are 240 rows in total.

Since we are treating the target variable Star.type as a category of stars, we need to change its type from int to factor using the as_factor function.

stars <- mutate(stars, Star.type = as_factor(Star.type))

Finally, we filter out the giant stars (types 3, 4 and 5) to isolate the dwarfs. Try using glimpse once again on the resulting data set dwarf_count and we will find that each star type has exactly 40 samples .

dwarfs <- stars %>%
filter(Star.type != 5) %>%
filter(Star.type != 4) %>%
filter(Star.type != 3)

dwarf_count <- dwarfs %>%
               group_by(Star.type)%>%
               summarise(n = n())

Exploratory Data Analysis

EDA refers to the initial investigation of the dataset to determine its main characteristics, spot any potential anomalies, and make sure that the question we want answered actually pertains to this dataset. This will help us determine what tests to conduct and check that our dataset fulfills the assumptions/conditions needed for that test to work properly.

In our case, we found earlier that there are 240 data points in our dataset. This is considered a large dataset, meaning something called the Central Limit Theorem takes effect and we can essentially treat our dataset as normally distributed. Normally distributed data is a condition of many statistical methods.

Screen Shot 2022-05-14 at 9.44.41 PM.png

Boxplots

Visualizations are a good way to add to our analysis. One way we can begin is by creating boxplots with to show the variation in luminosity and temperature across star types. This will give us a rough idea of the differences between each star type, and tell us if any groups are vastly different from the rest.

Here we are using ggplot and specifying geom_boxplot.

options(repr.plot.width = 10, repr.plot.height = 8) 
L_box <- ggplot(dwarfs, aes(x = Star.type, y = Luminosity.L.Lo.)) + geom_boxplot() + 
        ggtitle("Figure 1b: Luminosity vs Star Type") + xlab("Star Type") + ylab("Luminosity (Lo)")
T_box <- ggplot(dwarfs, aes(x = Star.type, y = Temperature..K.)) + geom_boxplot() +  
        ggtitle("Figure 1a: Temp. vs Star Type") +  xlab("Star Type") + ylab("Temperature (K)")
grid.arrange(T_box, L_box, ncol=2, widths = c(4,4))

boxplot.png

Above we can see the distributions of Luminosity vs. Star Type and Temperature vs. Star Type.

Without going too deeply into statistical terms, we can clearly see that Type 2 looks very different from Type 0 and Type 1 in Figure 1a. Specifically, its interquartile range represented by the rectangle is much wider than the other two. Figure 1b shows a similar trend with a wider Type 1.

To further find the within group distributions for each star type, we will once again use filter to separate the data into 3 parts, once for each star type.

type_0 <- filter(stars, Star.type == 0) #brown dwarfs
type_1 <- filter(stars, Star.type == 1) #red dwarfs
type_2 <- filter(stars, Star.type == 2) #white dwarfs

Histograms

Another way to view our data is to use ggplot to plot a histogram by specifying geom_histogram.

# Brown dwarf histogram
options(repr.plot.width = 10, repr.plot.height = 8) 
L_0 <- ggplot(type_0, aes(x = Luminosity.L.Lo.))+ xlab("Luminosity (L/Lo)") + geom_histogram(bins = 20) + ggtitle("Figure 2b: Luminosity distribution of Brown Dwarf")
T_0 <- ggplot(type_0, aes(x = Temperature..K.))+ xlab("Temperature (K)") + geom_histogram(bins = 20) + ggtitle("Figure 2a: Temp. distribution of Brown Dwarf")
grid.arrange(T_0, L_0, ncol=2,widths = c(4,4))

# White dwarf histogram
options(repr.plot.width = 10, repr.plot.height = 8) 
L_2 <- ggplot(type_2, aes(x = Luminosity.L.Lo.))+ xlab("Luminosity (L/Lo)") + geom_histogram(bins = 20)+ ggtitle("Figure 4b: Luminosity distribution of White Dwarf")
T_2 <- ggplot(type_2, aes(x = Temperature..K.))+ xlab("Temperature (K)") + geom_histogram(bins = 20)+ ggtitle("Figure 4a: Temp. distribution of White Dwarf")
grid.arrange(T_2, L_2, ncol=2)

Paying attention to the temperature histogram, we can see the distribution of the brown dwarf is skewed with more data points to the right of the graph, while the white dwarf is skewed to the left. This suggest that the two dwarfs may have different means. The red dwarf is omitted here, but the process is the same.

Brown Dwarf Distribution brown.png

White Dwarf Distribution white.png

Great - we now have a sense of the distribution within our three groups. For the second part of our EDA we will compute some parameter estimates.

Computing Parameter Estimates

Parameters are descriptive measures of our population in question, which would be all the stars in the universe. Because our goal is to determine whether there exists a difference between the means, it would be really nice to know the true means of all the brown, red, and white stars in the universe . Unfortunately, we almost never the resources to obtain that kind of data, which is why we compute parameter estimates, or sample statistics, in other words.

The process for each dwarf type is identical, with the only difference being in whether we pass type_0, type_1, or type_2 to the function summarise. Although we could have done this in a for-loop, we just repeated the process a few times for the sake of clarity in our report.

brown_dwarf_stats <- type_0 %>% 
    summarise(mean_lum = mean(Luminosity.L.Lo.), 
              sd_lum = sd(Luminosity.L.Lo.),
              mean_temp = mean(Temperature..K.),
              sd_temp = sd(Temperature..K.)) %>%
    add_column(Type = "Brown", .before = "mean_lum")
red_dwarf_stats <- type_1 %>% 
    summarise(mean_lum = mean(Luminosity.L.Lo.), 
              sd_lum = sd(Luminosity.L.Lo.),
              mean_temp = mean(Temperature..K.),
              sd_temp = sd(Temperature..K.)) %>%
    add_column(Type = "Red", .before = "mean_lum")
white_dwarf_stats <- type_2 %>% 
    summarise(mean_lum = mean(Luminosity.L.Lo.), 
              sd_lum = sd(Luminosity.L.Lo.),
              mean_temp = mean(Temperature..K.),
              sd_temp = sd(Temperature..K.)) %>%
    add_column(Type = "White", .before = "mean_lum")

These three code blocks produce three different tables, each with one row and four columns. Using the rbind function, we can combine these.

parameter_estimates <- rbind(brown_dwarf_stats, red_dwarf_stats, white_dwarf_stats)
parameter_estimates

Screen Shot 2022-05-13 at 7.45.11 PM.png

Above is a summary of all the statistics for each star type.

For the luminosity feature, the mean and standard error of brown dwarfs is significantly smaller than the mean of the red and white dwarfs by a power of 10.

For the temperature feature, the mean of white dwarfs is larger than that of the red and brown dwarfs by a power of 10.

What we notice here supports our hypothesis, and reinforces that the method we are going to use next will work.

Hypothesis Testing: ANOVA

ANOVA (Analysis of Variance) is a statistical method used for hypothesis testing that determines whether there exists a difference between the means of three or more groups. It works by computing the variance within the groups and the variance overall to produce a ratio known as an F-value, and an important probability known as the p-value.

  • null hypothesis: The commonly accepted fact. In this case, the null hypothesis is that the means of all three dwarf star groups are equal to each other.
  • alternate hypothesis: The opposite of the null hypothesis. In this case, the alternate hypothesis is that there is a difference in luminosity and temperature between red, brown, and white dwarf stars.
  • F-value: A value calculated by variation between sample means / variation within sample means.
  • p-value: The probability of obtaining the observed results assuming that the null hypothesis is true.

It is important to note that ANOVA only needs to prove that one mean differs from the other two. The null hypothesis states that all three means are equal to each other. To reject the null hypothesis, it is not necessary for all three to be different from each other.

Temperature

To begin, we select the columns that we want: Temperature..K., a numeric variable and the target of the current test, and Star.type, an explanatory factor specifying the three dwarf star groups.

dwarfs_temp <- dwarfs %>%
               select(Temperature..K., Star.type)

This snippet performs the actual ANOVA test and prints the results in a tidy table.

anova_temp <- aov(Temperature..K.~Star.type, data = dwarfs_temp)%>%
              tidy()
print("Table 10: ANOVA results for Temperature")
print(anova_temp)

Screen Shot 2022-05-14 at 3.00.49 AM.png

Finally, we filter and select the information we want. That is, the statistic and p.value.

f_stat_temp <- anova_temp %>%
          filter(term == "Star.type") %>%
          select(statistic) %>%
          as.numeric()

anova_pval_temp <- anova_temp %>%
              filter(term == "Star.type") %>%
              select(p.value) %>%
              as.numeric()

print(f_stat_temp)
print(anova_pval_temp)

Now we have completed the ANOVA test on the temperature feature and obtained the F-value = 188.91, and p-value = 2.74e-37.

Luminosity

To perform the ANOVA test on the luminosity feature, we repeat the same steps as above, this time using Luminosity.L.Lo. instead of Temperature..K.. The following code block condenses all the steps in the previous section so you can get a better sense of the workflow.

dwarfs_lumi <- dwarfs %>%
               select(Luminosity.L.Lo., Star.type)

anova_lumi <- aov(Luminosity.L.Lo.~Star.type, data = dwarfs_lumi)%>%
              tidy()

f_stat_lumi <- anova_lumi %>%
          filter(term == "Star.type") %>%
          select(statistic) %>%
          as.numeric()

anova_pval_lumi <- anova_lumi%>%
              filter(term == "Star.type") %>%
              select(p.value) %>%
              as.numeric()

print(f_stat_lumi)
print(anova_pval_lumi)

Now that we've completed the ANOVA test on the luminosity feature, we obtained the F-value = 5.09, and the p-value = 0.0076.

Results

Run this code to display the results obtained from both tests together...

table <- data.frame (Type  = c("F-Value", "p-Value"),
                     Temperature_Result = c(f_stat_temp, anova_pval_temp),
                     Luminosity_Result = c(f_stat_lumi, anova_pval_lumi))
table

And voilà - the final results of our hypothesis tests! All that remains now is interpreting these results.

Screen Shot 2022-05-13 at 7.42.33 PM.png

Interpreting Results

Temperature Results

The F-value: is 188.91 and p-value is 2.74e-37. An F-value significantly greater than 1 means that the variation between sample means is significantly large, relative to the variation within the samples. From a statistical standpoint this indicates that at least one of the mean values is indeed different across three groups.

Luminosity Results

The F-value for luminosity obtained is 5.09, and the p-value is 0.0076. We get a much smaller F-value this time but it is still clearly larger than 1. Even though this difference in luminosity across the groups is not as great as what was found in temperature, we can still confidently say that at least one mean value is different from the others.

Our Conclusion

The results of our ANOVA test indicate that there exist differences between the three types of dwarfs in terms of temperature and luminosity. It is common in statistics to select a p-value of 0.05 to be the "cut-off" point of statistical significance. Since both our p-values are smaller than 0.05, we are able to reject the null hypothesis. This means that the probability of what has just occurred happening assuming our null hypothesis is true is so slim that we will reject this null hypothesis in favour of our alternate hypothesis.

We are now relatively confident that the three types of dwarfs do not have the same average temperature and luminosity.

Wrapping Up

And that's it - We have now successfully completed a full hypothesis test and report using R.

The steps we took are definitely not fool-proof but you can read up on our potential sources of error, the future directions for this project, and more statistics mumble-jumble on our full project on GitHub.

Thanks for joining me!