Project 2#: cluster analysis | business management | Harvard University
For this project, we’re going to use cluster analysis to “tell a story” about our data. I’m asking you to divide the Oregonians in your sample into groups or clusters based on two quantitative variables. Your “story” will be an explanation of your data that highlights some interesting feature(s) or makes a point about the data.
Please note this project will likely take some trial and error. Please relax into it and have some fun with the process: think of it as an exploration. Trial and error is the spice of life!
You will begin by opening a small data set. If you don’t have a data set of your own you’d like to explore, I recommend taking a random subset of the OregonPUMS_data. You will take a small subset of this data (I recommend n=400, so as not to upset XLSTAT too much: some clustering algorithms grind to a halt with large data sets).
Note that “Weight,” just like last time in project 1, is probably not what you think it is! I’m not forbidding “weight” but realize you’ll have to do a bunch of research about what “weight” is in order to write about it well and get credit!
Step 1: Select your sample of n=400. Lucky for us, XLSTAT is quite good at taking a random sample. Check out: Simple Random Sampling in XLSTAT ; alternatively there are several tutorials online for taking a random sample with regular excel.
Step 2: Choose two QUANTITATIVE variables that you would like to work with. Copy and paste your two variables and their corresponding sampled data (there should be 400 rows of data, two columns) into a new sheet. I prefer to do this so that I am not overwhelmed by variables that I am not using. Next, remove any rows with missing observations. This will save time later when you go to plot your clusters.
For the following steps, be sure that you have installed the XLSTAT add-in. Click on the XLSTAT tab on the top of your Excel sheet.
Step 3: Use different options in the software to create 5 different “data stories”: if you’re overwhelmed about what to pick, you can use these options:
*Scatterplots will have to be created separately using the Results by Object output. Under the colors tab, use whatever colors you would like, but be sure they are bold and distinct. For example, it would be a bad idea to use white or both red and red-orange.
Step 4: Write up your project! Which clustering method out of the five did you prefer? Why?
For your final report, compare and contrast each of the five clustering methods. You may choose to use your XLSTAT output or use Tableau/other software to make a prettier graph. Tell your story using your preferred clustering method, and how the clustering supports that story. Who are these groups? What does this clustering tell us about the people in Oregon? How could a business or entrepreneur use their understanding of this clustering story to further their goals?
Rubric for Project (40 points)
Maximum 5 points total if your variables are not quantitative! You must have two quantitative variables!
15 points: at least 5 different scatter plot graphs, all using the same basic variables (Step 1) but different clustering choices (Step 3). Data process and data product both discussed, particularly for Method 5. Clusters must vary, and at least one method shouldn’t be “just stripes,” e.g. both variables should matter.
10 points: your narration of the progression of your thinking (data process story).
5 points: Instructor’s subjective take on the product story. Was it gripping, interesting, well done?
5 points: graph conventions, labels, etc.
5 points: conventions: correct punctuation, sentences, etc.
Data Note: Be careful about the “Person’s Weight” variable. This does not mean “how much this person weighs” it means “how much weight to assign this person’s answers.” If you’re curious (not required) you can read about statistical weighting here: http://www.applied-survey-methods.com/weight.html (Links to an external site.).
Requirements: around 3 pages | .doc file