People don't always say what they mean...

... And sometimes, even when they say something they apparently mean, really they’re trying to imply something else. The other day I put my mind to coming up with some good clues about what the patterns were. Or to look at it a different way, what is it that people want you to think? Now, clearly the pattern is context dependent; the sorts of things people say in one place aren't the same as those they'd say in another

I was curious, but I didn’t want to found a major study. What follows is an account of how I went about satisfying my curiosity. I make no warranties as to its validity but would love to hear from you if you think it’s interesting or have any suggestions.
I chose a social networking site which had 100,000 people's profiles nicely written up. I grouped these by gender, because I needed to construct two groups in order to do a comparison. For the purposes of comparison, I make the assumption that this is a collectively exhaustive set. This is analytically convenient, but possibly politically incorrect. However, that's a risk with any grouping.

Because some had been abandoned, deleted or hidden, not all 100,000 profiles had viable data. I found only around 15,680 which identified as females. Around 23,490 identified as males. I ranked each word by total number of occurences in the corpus, using the Python Natural Language Toolkit , which handily includes a Porter stemmer. I ignored words with less than 100 occurrences all up. I then sorted by gender seeking bias, by dividing the number of occurences in one group (say, female) by that on the other (say, male), and ranking. I also reconstructed the original words after having passed through the stemmmer.

Results:
There are two groups - those males, and females. These are representative of the majority of the data, though the occasional profile indicates a lack of bias either way.

Female Profile Males Profile
Word How much more often than Males?
sew

23.19321839
aerobics 10.01794512
himself 8.780700902
decorating 8.638515064
taller 8.523774962
netball 7.433230546
pilates 7.251477711
craft 6.166898294

cleo 6.127854936

bridget 5.842838428
trashy 5.715000141
patricia 5.515069443
interior 5.486567792
nurse 5.323893275
fabulous 4.956976306
cosmo 4.792777606
knit 4.587669572
cornwell 4.540312983
husband 4.346501757
gossip 4.210189514
masculine 4.180242127
girly 4.157795772
yoga

4.119552045
boon 4.083986552
he 3.95502485
forensic

3.95032881
champagne 3.880609387
belly 3.847722867
softball 3.742593827
horseride 3.725572935
brunch 3.715042768
murder 3.64113372
him 3.53703717
grandchildren 3.486998848
emergency 3.483535106
scary 3.407831685
diary 3.348943977
violent 3.304650871
ballet 3.298048172
puzzle 3.206435723
vivacious 3.102040141
mum 3.08652024
crossword 3.074106629
Word How much more often than Females?
herself 15.90551142
mechanic 11.0888097
skateboard 8.080340026
allway 7.71885113
bla 7.525625608

fhm 6.96840727
electric 6.404530158
engine 6.012214011
moto 5.776734317
dj 5.357814304
snooker 5.197879549
clancy 5.06793256
electronica 5.009456415
technical 4.826602438
woodwork 4.586364308
clube 4.580631352
reckon 4.513948029
ralph 4.418197617
she 4.409582347
discreet 4.219455163
armi 4.17359152
pc 3.980481444
motorsport 3.941725325
geek 3.939445553
swime 3.864510447
chef 3.82415424
injoy 3.768462673
bloke 3.764159792
motorcycle 3.73522042
chess 3.66661233
ju 3.601387973
4×4 3.595200021
pilot 3.592106045
snowboard 3.542966428
wakeboard 3.541983635
construct 3.508568695
freind

3.424278757
snatch 3.407498815
sking 3.39538906
trade 3.303737001
windsurf 3.234236164
harley 3.220425373
lad 3.15585544
martial 3.056353175
aviat 3.0509293

Conclusions…

  • Males use spell check far more sparingly. I should mention that this is much clearer from the unstemmed lists.
  • The word ‘discreet’ occurs surprisingly frequently in male profiles. There was almost no representation of these in the under-30 age group (only two in the whole set), with the bulk occuring in the 30-50 group. No prizes for guessing the context.
  • Preference to pronouns - Perhaps obviously, in light of the previous point, intensive pronouns like ‘himself’ are used frequently in those females, ‘herself’ for those males. Interestingly though, ‘guy’ rates much more highly for those males (2:1) than females. ‘man’ rates much more highly in those females (2.5 : 1); you might suspect there was a little bit of substitution and nuance going on, assuming that the words are, at least a little, synonymous. Some answer is found in that ‘guys’ ranks quite well for those females also; it’s not clear cut.
  • I’ve corrected some of the results from the Stemmer so they’re a little more readable.

So where next? It would be interesting to correlate something like drinking or smoking habits, against reported psychological type, for example. I would suppose that if this hasn't already been done by an epidemiologist somewhere, it would have certainly been done to death by a tobacco or liquor operation.

The Code

If you would like to try the Natural Language toolkit for yourself, here's the code I used. It's not particularly neat.

 
#!/usr/bin/python

from nltk.probability import * 
from nltk.token import * 
from nltk.tokenizer import *  
from nltk.stemmer.porter import * 
from math import * 


stemmer = PorterStemmer() male_file = open('Male.txt') freq_stemmed_male = FreqDist() regexp = r'\w+|[^\w\s]+' tokenizer = RegexpTokenizer(regexp) while 1: line = male_file.readline() if not line: break sample_male = Token(TEXT=line) tokenizer.tokenize(sample_male) for word in sample_male['SUBTOKENS']: stemmer.stem(word) freq_stemmed_male.inc(word['STEM'].lower()) words_stemmed_male = freq_stemmed_male.samples() female_file = open('Female.txt') freq_stemmed_female = FreqDist() while 1: line = female_file.readline() if not line: break sample_female = Token(TEXT=line) tokenizer.tokenize(sample_female) for word in sample_female['SUBTOKENS']: stemmer.stem(word) freq_stemmed_female.inc(word['STEM'].lower()) words_stemmed_female = freq_stemmed_female.samples() # Create the union of the two sets in case some words # were not mentioned in one. # Use a dictionary to do this quickly. words = {} for word in words_stemmed_female: words[word] = 1; for word in words_stemmed_male: words[word] = 1; wordlist = words.keys(); # Print out the word scores, as well as the # actual frequencies, for all words. # Keep the sign on the score because it lets # the male stuff be at one end # and the female at the other for word in wordlist: total = freq_stemmed_male.count(word) + \\ freq_stemmed_female.count(word) + \\ print str(total) + "\t" + word + "\t" + \\ str(freq_stemmed_male.count(word)) + "\t" + \\ str(freq_stemmed_female.count(word));