Count words in file

For sample input, we'll use the example file that was generated using Cupcake Ipsum - Sugar-coated Lorem Ipsum Generator. This example will show how to count the number of words contained in a file using java, java 8 and guava.

Setup

private static final String SOURCE = "com/levelup/java/io/word-occurrences-in-file.txt";

private URI sourceFileURI;

@Before
public void setUp () throws URISyntaxException {
    sourceFileURI = this.getClass().getClassLoader().getResource(SOURCE).toURI();
}

Straight up Java

This snippet will show how to count the number of words from a text file using java 7 syntax. You could read the file into an arraylist using another library for backwards compatibility.

First we will read the lines of the text file by calling Files.readAllLines and storing the results in an array list. Next we will create a HashMap that will store the word found as the key and the value will represent the number of times it was found. Iterating over each line in the file and splitting the string by a space we will check if the word exists in the map, if so we will increment the count otherwise we will put it to the map with an intial value.

@Test
public void distinct_words_in_file_java() throws IOException {

    File file = new File(sourceFileURI);

    List<String> lines = java.nio.file.Files.readAllLines(
            Paths.get(file.toURI()), Charsets.UTF_8);

    Map<String, Integer> wordOccurrences = new HashMap<String, Integer>();

    // for each line in file
    for (String line : lines) {

        String[] words = line.split(" ");

        // for every word in file
        for (String word : words) {

            word = word.replace(".", "");

            if (!word.trim().isEmpty()) {
                if (wordOccurrences.containsKey(word)) {
                    int count = wordOccurrences.get(word).intValue();
                    wordOccurrences.put(word, new Integer(count + 1));
                } else {
                    wordOccurrences.put(word, new Integer(1));
                }
            }
        }
    }

    logger.info(wordOccurrences);

    assertEquals(80, wordOccurrences.size());
}

Output

{Cake=1,
tart=5,
dragée=4,
...
rops=4,
brownie=2,
Dessert=1,
pastry=2,
claw=2,
sugar=4}

Java 8

Using java 8 syntax we will find the unique words contained within a text file. Java 7 file api NIO introduced Files, a static utility that contains methods for working with files, we will read all lines from a file as a Stream. Then calling Stream.flatmap we will break the line into words elements. If we had a line made up of "she skipped while she was chewing bubble gum", this line would be broken into ["she", "skipped", "while", "she", "was", "chewing", "bubble", "gum"]. Calling the the Stream.distinct method will find all unique occurrences of words.

@Test
public void count_distinct_words_java8() throws IOException {

    File file = new File(sourceFileURI);

    long uniqueWords = java.nio.file.Files
            .lines(Paths.get(file.toURI()), Charset.defaultCharset())
            .flatMap(line -> Arrays.stream(line.split(" ."))).distinct()
            .count();

    assertEquals(80, uniqueWords);
}

Google Guava

Wikipedia defines a multiset, in mathematics, as "a generalization of the notion of set in which members are allowed to appear more than once... In multisets, as in sets and in contrast to tuples, the order of elements is irrelevant: The multisets {a, a, b} and {a, b, a} are equal." A guava multiset is like a set but can contain multiple items and has methods has useful summary methods like count. Below we will use guava Splitter to split a string and load a HashMultiset.

@Test
public void count_distinct_words_in_file_guava () throws IOException {

    File file = new File(sourceFileURI);

    Multiset<String> wordOccurrences = HashMultiset.create(
      Splitter.on(CharMatcher.WHITESPACE)
        .trimResults(CharMatcher.is('.'))
        .omitEmptyStrings()
        .split(Files.asCharSource(file, Charsets.UTF_8).read()));


    logger.info(wordOccurrences);

    assertEquals(80, wordOccurrences.elementSet().size());
}

Output

Cake, tart x 5,
dragée x 4,
Halvah, soufflé x 5,
Fruitcake x 2,
wafer x 2,
Sesame, Macaroon,
canes x 3,
...
brownie x 2,
Dessert,
pastry x 2,
claw x 2,
sugar x 4