Thinking with Pandas

You can see and run all of my work for this blog post at Colabratory.

Pandas is a Python library for manipulating data. Wrapping my head around it took a while: I’ve been using it for ~6 months and I’m still learning how to use it “right.” It uses all kinds of syntactic sugar to optimize working with vectors and matrices instead of scalars. This makes working with Pandas very different than working with “vanilla” Python.

For example, let’s say you wanted to get a bunch of random dice rolls for playing Dungeons and Dragons. D&D uses 20-sided dice, so in normal Python, you’d probably do something like:

rolls = [random(1, 21) for i in range(0, 10000)]

In Pandas, it would be something like:

rolls = pd.DataFrame(np.random.randint(1, 21, size=(10000, 1)), columns=['roll'])

In D&D, rolling a 20 on the die is special and called a “critical hit.” It usually does good things for the player. If we iterate through rolls seeing how many critical hits we have in vanilla Python, it’s pretty fast:

%%timeit
count = 0
for roll in rolls:
  if roll == 20:
    count += 1
 
# Output:
1000 loops, best of 3: 267 µs per loop

If we do the same naive approach with Pandas, it’s, uh, slower:

%%timeit
count = 0
for roll in rolls.iterrows():
  count += (roll[1] == 20)
count
 
# Output:
1 loop, best of 3: 3.12 s per loop

That’s over 10,000x slower (267 µs -> 3.12 seconds). Like, they-must-have-put-a-sleep()-in-iterrows()-to-discourage-you-from-using-it slow.

So why use Pandas? Because it isn’t slow if you do it the “Pandas way”:

%%timeit
(rolls.roll == 20).sum()
 
# Output:
1000 loops, best of 3: 341 µs per loop

Nice! Only 1.3x slower than vanilla Python. Also, notice the syntactic sugar: you can pretend that the vector is a single number, comparing it to a scalar. If you look at (rolls.roll == 20), it’s a series of booleans:

(rolls.roll == 20).head()
0    False
1    False
2    False
3    True
4    False

When you take the sum() of that series, False is converted to 0 and True to 1, so the sum is the number of True elements.

Modifying some elements of a vector

If you’re attacking and your roll (as calculated above) is greater than the defender’s armor class (say, 14), then you roll for damage. Suppose you do 2d6 damage on a hit. If your attack roll is greater than or equal to 14, then you do 2d6 damage, otherwise the blow glances off their armor and you do 0 damage.

With vanilla Python, this would look something like:

for roll in rolls:
  if roll >= 14: 
    damage = roll_damage()
  else: 
    damage = 0

However, with Pandas, we shouldn’t loop through the rows. Instead, we’ll use a mask. Like a theater mask, a Pandas mask is an opaque structure that you “punch holes” in wherever you want operations to happen. Then you apply the mask to your Pandas dataframe and apply the operation: only the entries where there are “holes” will get the operation applied.

First we create the mask with the criteria we want to update:

mask = rolls.roll >= 14
mask.head()
0    False
1    False
2    False
3    True 
4    True 
Name: roll, dtype: bool

Now we want to:

  1. Grab all the hits (wherever the mask is True).
  2. Generate random numbers for each hit (equivalent to rolling 2d6).
  3. Merge those hits back into the rolls dataframe.

First we’ll create a series that tracks all of the hits. We want to be able to merge that back into our original dataframe at the end, so we want to preserve the index of each hit from the original dataframe. Based on the mask above, this would be [3, 4, ...] and so on (wherever the mask is True). We get this with mask[mask], which is a series of all the True values with their associated index. Then we set every element of this series to a randomly generate 2d6 roll:

hits = pd.Series(index=mask[mask].index)
hits.loc[:] = np.random.randint(1, 7, size=(len(hits),)) + np.random.randint(1, 7, size=(len(hits),))
hits.head()
3     6
4     3
9     8
10    3
14    3
dtype: int64

Note that we’re generating a “1D” random int (randint(len(hits),)) for the damage instead of the 2D one above (randint(10000,1)) because this is a series (1D), not a dataframe (2D).

Then we can combine that damage back into the rolls dataframe using our original mask:

rolls.loc[mask, 'damage'] = hits
rolls.loc[~mask, 'damage'] = 0
rolls.head()
        roll	damage
0	8	0
1	8	0
2	6	0
3	17	7
4	15	7

This lets you quickly and selectively update data.

Also, I’m still learning! Let me know in the comments if there’s a better way to do any of this.

kristina chodorow's blog