Computing phrase frequencies with phrasefreq

Description

phrasefreq generates a n-gram frequency matric from text files

Usage

phrasefreq n filename1.txt [filename2.txt filename3.txt ...]

Notes

phrasefreq works just like wordfreq except that it requires an extra parameter n that specifies the number of words to be hierarchically included in each 'phrase'. For instance, n=2 produces word frequencies for all word pairs (bigrams) and single words (unigrams). For example, n=3 computes the frequencies of all trigrams, bigrams, and single words.

The resulting dataset consists of a text variable 'phrase' containing a list of the word n-grams, with individual words in (n>1)-grams joined together by the underscore '_' character. This is followed by the set of frequency variables, one for each text, with the names tfilename1, tfilename2, tfilename3, etc. Each frequency variable will range from 0 to a maximum of the total words associated with its text file. Finally, phrasefreq adds a variable called ntuple which is the number of words joined together in phrase.

Although phrasefreq is designed to take any value of n, it will start to break down for values more than 5, since the maximum length in Stata 7 for a string is 80 characters.

Currently phrasefreq is hierachically inclusive, in that it includes all lower order word sequences n-1,n-2,...,1 as well as n-tuples. If only word sequences of length n are required, then the lower order n-tuples can be dropped using the command

keep if ntuple==n.

Stata Code

Examine the source.

Up to Table of Contents