Need a quick count? Use defaultdict!

I had read about defaultdict recently, and the other day I needed to get a count based on some xml data. What a joy and convenience to use!

The data is for the Pentaho marketplace, and I wanted to know how for each plugin or "market_entry", what type of plugin are we dealing with?

I had the file locally, but in the code below I show how to grab it from the web using urlretrieve. Next, I set up a default_dict based on the integer type to store my counts. Using ElementTree I could quickly parse the XML and "find" the text for the "type" tag in the XML. At this point it was a simple matter of incrementing the count, as shown in the line:

entry_type[entry_type] += 1

You may never need information about Pentaho plugins as I did, but this is a great tool to have in your toolbox. You can count cute Internet kittens with it (if you can wait long enough!).

In [1]:
import urllib.request
import xml.etree.ElementTree as ET
from collections import defaultdict

# Download the marketplace metadata xml file
urllib.request.urlretrieve('', 'marketplace.xml')

# Parse the data
et = ET.parse('marketplace.xml')

# Use defaultdict to count the types
entry_types = defaultdict(int)
for market_entry in et.iter('market_entry'):
    entry_type = market_entry.find('type').text
    entry_types[entry_type] += 1

# Print the results (unformated version just "print(entry_types")    
print("Count: Type\n------ ---------")
for key in entry_types.keys():
    count = str(entry_types[key]).rjust(5)
    print("{0}: {1}".format(count, key))
Count: Type
------ ---------
   71: Platform
    7: Database
   83: Step
    3: JobEntry
   14: Mixed
    1: Partitioner
    3: HadoopShim
    4: SpoonPlugin