People don't always say what they mean...
... And sometimes, even when they say something they apparently mean, really they’re trying to imply something else. The other day I put my mind to coming up with some good clues about what the patterns were. Or to look at it a different way, what is it that people want you to think? Now, clearly the pattern is context dependent; the sorts of things people say in one place aren't the same as those they'd say in another
I was curious, but I didn’t want to found a major study. What follows is an account of how I went about satisfying my curiosity. I make no warranties as to its validity but would love to hear from you if you think it’s interesting or have any suggestions.
I chose a social networking site which had 100,000 people's profiles nicely written up. I grouped these by gender, because I needed to construct two groups in order to do a comparison. For the purposes of comparison, I make the assumption that this is a collectively exhaustive set. This is analytically convenient, but possibly politically incorrect. However, that's a risk with any grouping.
Because some had been abandoned, deleted or hidden, not all 100,000 profiles had viable data. I found only around 15,680 which identified as females. Around 23,490 identified as males. I ranked each word by total number of occurences in the corpus, using the Python Natural Language Toolkit , which handily includes a Porter stemmer. I ignored words with less than 100 occurrences all up. I then sorted by gender seeking bias, by dividing the number of occurences in one group (say, female) by that on the other (say, male), and ranking. I also reconstructed the original words after having passed through the stemmmer.
Results:
There are two groups - those males, and females. These are representative of the majority of the data, though the occasional profile indicates a lack of bias either way.
| Female Profile | Males Profile | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Conclusions…
- Males use spell check far more sparingly. I should mention that this is much clearer from the unstemmed lists.
- The word ‘discreet’ occurs surprisingly frequently in male profiles. There was almost no representation of these in the under-30 age group (only two in the whole set), with the bulk occuring in the 30-50 group. No prizes for guessing the context.
- Preference to pronouns - Perhaps obviously, in light of the previous point, intensive pronouns like ‘himself’ are used frequently in those females, ‘herself’ for those males. Interestingly though, ‘guy’ rates much more highly for those males (2:1) than females. ‘man’ rates much more highly in those females (2.5 : 1); you might suspect there was a little bit of substitution and nuance going on, assuming that the words are, at least a little, synonymous. Some answer is found in that ‘guys’ ranks quite well for those females also; it’s not clear cut.
- I’ve corrected some of the results from the Stemmer so they’re a little more readable.
So where next? It would be interesting to correlate something like drinking or smoking habits, against reported psychological type, for example. I would suppose that if this hasn't already been done by an epidemiologist somewhere, it would have certainly been done to death by a tobacco or liquor operation.
The CodeIf you would like to try the Natural Language toolkit for yourself, here's the code I used. It's not particularly neat.
#!/usr/bin/python from nltk.probability import * from nltk.token import * from nltk.tokenizer import * from nltk.stemmer.porter import * from math import *stemmer = PorterStemmer() male_file = open('Male.txt') freq_stemmed_male = FreqDist() regexp = r'\w+|[^\w\s]+' tokenizer = RegexpTokenizer(regexp) while 1: line = male_file.readline() if not line: break sample_male = Token(TEXT=line) tokenizer.tokenize(sample_male) for word in sample_male['SUBTOKENS']: stemmer.stem(word) freq_stemmed_male.inc(word['STEM'].lower()) words_stemmed_male = freq_stemmed_male.samples() female_file = open('Female.txt') freq_stemmed_female = FreqDist() while 1: line = female_file.readline() if not line: break sample_female = Token(TEXT=line) tokenizer.tokenize(sample_female) for word in sample_female['SUBTOKENS']: stemmer.stem(word) freq_stemmed_female.inc(word['STEM'].lower()) words_stemmed_female = freq_stemmed_female.samples() # Create the union of the two sets in case some words # were not mentioned in one. # Use a dictionary to do this quickly. words = {} for word in words_stemmed_female: words[word] = 1; for word in words_stemmed_male: words[word] = 1; wordlist = words.keys(); # Print out the word scores, as well as the # actual frequencies, for all words. # Keep the sign on the score because it lets # the male stuff be at one end # and the female at the other for word in wordlist: total = freq_stemmed_male.count(word) + \\ freq_stemmed_female.count(word) + \\ print str(total) + "\t" + word + "\t" + \\ str(freq_stemmed_male.count(word)) + "\t" + \\ str(freq_stemmed_female.count(word));