In this post, we're going to walk through another common coding interview problem: finding the longest word in a paragraph. This is a really good question, because it's very easy to forget some important details, and the perhaps obvious solution isn't necessarily the best.

Walking Through the Problem

For this particular problem, we're going to imagine we've been given the following string of lorem ipsum text. For the sake of space, we'll assume that this string is being used as the value for a variable called base_string.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas luctus ipsum et facilisis dignissim. Duis mattis enim urna. Quisque orci nunc, vulputate id accumsan nec, imperdiet sit amet sem. Integer consequat nibh vel mattis elementum. Nam est elit, sodales vitae augue nec, consequat porttitor nibh. Aliquam ut risus vehicula, egestas ligula sed, egestas neque. Fusce hendrerit vel risus in molestie. Morbi molestie eleifend odio vel ullamcorper. Donec at odio libero. Quisque vulputate nisl nisi, ut convallis lorem vulputate et. Aenean pretium eu tellus a dapibus. Ut id sem vulputate, finibus erat quis, vestibulum enim. Donec commodo dui eros, non hendrerit orci faucibus eu. Integer at blandit ex. Duis posuere, leo non porta tincidunt, augue turpis posuere odio, sed faucibus erat elit vel turpis. Quisque vitae tristique leo.

Lorem ipsum is just a common dummy text used in typesetting and printing to simulate actual text content. It includes words of various lengths, as well as punctuation, so it's perfect for our purposes.

So how do we find the longest word?

Since our paragraph is currently a single string, we're probably going to have to find a way to break it up into individual words. This will let us to perform a direct comparison between different words, allowing us to find out whether a given word is longer than another.

There are a few ways we can perform the actual comparison.

We can create a variable called something like longest_word set to an empty string, and then iterate over all the words, comparing the length of the current longest_word with the word we're checking as part of this iteration. If the new word is longer, we can replace the value of  longest_word with the new word. Repeating this process for every word in the paragraph will leave us with the longest word in the paragraph bound to the longest_word variable.

Another option is that we use the max function, which takes in an iterable, and returns a single item from the collection. It will return the single largest member of the collection, but we'll have to provide some configuration to max so that it knows what that means given the context.

Both of these solutions potentially have a small issue in that they only return a single word. What if two words are of identical length? Do we provide one of these longest words as a solution, or do we provide a collection of all the longest words? If we have to provide all the longest words, then we're going to need to use a different method.

With that, let's move onto out first bit of code for this problem. There is a problem with these solutions, which I'll talk about in a moment.

Our First Solution

Python includes a handy method for strings called split. You can read about split in the official documentation.

split will essentially allow us to divide a string into a list of items using a delimiter string. Whenever this delimiter is encountered in the string, split will create a break point. Everything between these break points becomes an item in the resulting list.

We can therefore do something like this:

word_list = base_string.split(" ")
# ["Lorem", "ipsum", "dolor", "sit", "amet,", ... etc.]

Eagle eyed readers may have already noticed a problem. The comma after amet is included in the corresponding list item. This is the case for all words that have punctuation, and it poses a serious problem.

Our text actually has two longest words, but one of them is at the end of a sentence, and therefore has an appended full stop. With the punctuation included, this word incorrectly becomes the sole longest word in the paragraph.

The problem could be even worse if we have punctuation like quotation marks, since the inclusion of two punctuation characters might make a word longer than the real longest word, despite being potentially a character shorter in reality. This only gets worse as punctuation gets compounded together.

It's clear then that we have to do something about all this extra punctuation.

We can make use of another string method called strip to take of the problem. Once again, you can find details on how strip works in the official docs.

strip allows us to remove characters from either end of a string, and it will keep removing those characters until it encounters a character which does not match those it was instructed to remove. By default, it removes whitespace characters, but we can provide a collection of punctuation instead.

Using a list comprehension, we can therefore iterate over our word_list and correct the punctuation issue:

word_list = base_string.split(" ")
processed_words = [word.strip(".,?!\"\':;(){}[]") for word in word_list]

This is absolutely going to work for our example string, but it's also not a very good solution. For a start, having to use these escape characters is ugly, but more importantly, this group of punctuation is not complete, and it would very tedious to make it complete.

Right now we don't do anything to filter out tildes for example (~), or percentage signs, or hash symbols. There's also another issue we didn't consider at all thus far: numbers. Numbers could easily end up being larger than our longest words, giving us erroneous results.

Instead of removing things we don't wan't, perhaps we can go the other way, adding only the things we do want. Enter RegEx.

Using RegEx

RegEx or regular expressions are a means through which we can define patterns to look for in strings. RegEx is a pretty expansive topic, certainly too big to cover here, but there's a tutorial on using RegEx in Python available in the documentation.

In order to work with regular expressions in Python, we need to import the re module, so that's our first step:

import re

Next, we'll take our base_string variable, containing our lorem ipsum text, and provide as an argument to the findall function that we find in the re module. Along with base_string, we're also going to provide a pattern using the RegEx syntax like so:

import re

word_list = re.findall("[A-Za-z]+", base_string)

RegEx is notoriously cryptic, but our pattern here is relatively simple. It says that we're looking for any letter character in the basic Latin alphabet; we don't care about case; and we want patterns that include any number of these characters, as indicated by the + symbol.

Our resulting word_list variable will now contain a list full of letter only strings. While traversing our string, any time a character was encountered that wasn't part of the our defined pattern, findall considered that the end of a word.

Pretty neat stuff if you ask me.

Another Minor Problem

Unfortunately, it's too early to celebrate just yet. There's once again a problem with our current implementation: punctuation inside words.

Consider a situation like this:

John's dog hasn't eaten its food.

We might make a very good argument that "John's" and "hasn't" are single words, in which case, we have a problem. Right now if we run our code on this sentence we get the following:

import re

base_string = "John's dog hasn't eaten its food."
word_list = re.findall("[A-Za-z]+", base_string)

# ['John', 's', 'dog', 'hasn', 't', 'eaten', 'its', 'food']

This might be perfectly fine, but what happens if your interviewer asks you to treat "John's" as one word? In that case, "John's" is a 5 letter word with a piece of punctuation.

The best answer might be to combine our methods. We can split the string based on spaces, and then check each word for punctuation. This time, however, instead of splitting the string, we'll perform a replacement using another re function called sub.

sub takes a pattern, a replacement string for that pattern, and a string to search.

import re

base_string = "John's dog hasn't eaten its food."
word_list = base_string.split(" ")
processed_words = [re.sub("[^A-Za-z]+", "", word) for word in word_list]

# ['Johns', 'dog', 'hasnt', 'eaten', 'its', 'food']

Here our pattern is exactly the same as before, except we've added a special character, ^, which essentially says match anything which is not in this pattern. As our pattern includes only basic Latin letters, all punctuation is going to be matched by this pattern. Our replacement string is an empty string, which means any time we find a match, the matching characters will simply be removed.

Putting this in a list comprehension, we can perform the substitution for every word in the word_list, giving us a list of properly processed words.

With that, we can finally start counting the length of the words and find the longest word in the paragraph.

Finding the Longest Word

As I mentioned before, we have a couple of options for finding the longest word in our list of words.

Let's start with the string replacement method, using our original paragraph:

import re

word_list = base_string.split(" ")
processed_words = [re.sub("[^A-Za-z]+", "", word) for word in word_list]

longest_word = ""
for word in processed_words:
	if len(longest_word) < len(word):
		longest_word = word

print(longest_word)  # consectetur

In this implementation, we start with an empty string, and begin iterating over all the words in the processed_words. If the current word is longer than the current longest_word we replace the longest word. Otherwise, nothing happens.

Since we start with an empty string, the first word will become the longest string initially, but it will be dethroned as soon as a longer word comes alone.

Instead of this fairly manual method, we could make use of the max function.

import re

word_list = base_string.split(" ")
processed_words = [re.sub("[^A-Za-z]+", "", word) for word in word_list]

longest_word = max(processed_words, key=len)

print(longest_word)  # consectetur

In this method we have to pass in the processed_words as the first argument, but we also have to provide a key. By providing the len function as a key, max will use the length of the various words for determining which is of the greatest value.

Both of these methods only provide a single word, so how do we get all of the longest words if we have multiple words of equal length?

Finding a List of the Longest Words

In order to find all of the longest words, we can do a second pass over our processed_words now that we know what the longest word actually is. We can do this using a conditional list comprehension.

import re

word_list = base_string.split(" ")
processed_words = [re.sub("[^A-Za-z]+", "", word) for word in word_list]

max_word_length = len(max(processed_words, key=len))
longest_words = [word for word in processed_words if len(word) == max_word_length]

print(longest_words)  # ['consectetur', 'ullamcorper']

Wrapping Up

With that, we've successfully found the longest words in a paragraph. Of course this isn't the only way you could solve this problem, and the specific problem might call for minor variations as well. You may be asked to find the length of the longest word in the paragraph, but if you can do the version shown here, that question should be a breeze.

We'd love to see your own solutions to this problem, so get on repl.it and share your solutions with us on Twitter. If you liked the post, we'd also really appreciate it if you could share it with your techy friends.

We'll see you again Monday when we'll be releasing part two of our brief look at the itertools module!