Analytics for Knowledge Creation (A4KC) Part III: Introducing the pandas data analysis toolkit
In this blog post I’ll build on the previous A4KC posts ( Part I and Part II), which focused on getting the data out of a database and into a useable format, and highlight some of the features of the pandas python package. But first, I need to digress a bit to provide a bit of context about the problem I’m working on.
I’m interested in understanding patterns of discourse, particularly discourse generated by groups of people who are trying to understand difficult concepts. In my experience, one of the best environments for doing so is Knowledge Forum. (Disclaimer: I was part of the research and development team responsible for Knowledge Forum.) In a nutshell, Knowledge Forum consists of “notes” (postings) that are organized into two-dimensional graphical representations called “views”. Notes are composed in an editor that is designed to encourage the sorts of behaviour that we think are conducive to knowledge building. These features include: (1) the use of a “problem statement” to define the top-level goal that the author is working towards, (2) the use of thinking scaffolds that can be used as sentence openers, tags, or post hoc markup of note contents, (3) the use of keywords, (4) the ability to explicitly reference other notes or text segments and automatically create a bibliography, (5) the ability to create superordinate structures via a “rise-above” feature. Right now I’m interested in looking at the relationship between scaffold use and the nature of collaborative discourse. I have some ideas about what I might find in the data (for example, I suspect there’s a spatio-temporal relationship between notes marked as “I need to understand” and “New information”). Right now, through, I need to get an idea about how scaffolds are used in general. Specifically, I need to know who is using what scaffold and how frequently they’re doing so. I’m going to use a subset of 165 notes produced by a groups of 10- and 11-year-old students.
If you haven’t already heard about pandas, get over to the pandas site and check it out. You should also pick up a copy of Python for Data Analysis by Wes McKinney.
In Part I, I provided a script that extracted some data. If we take the first 35 lines of that script (up to the json.loads(…) call) and put that in a script called loaddata.py, we can use IPython to run it and provide a useful environment to continue our data exploration:
moomin:~ cteplovs$ ipython
Python 2.7.2 (default, Jun 20 2012, 16:23:33)
Type "copyright", "credits" or "license" for more information.
IPython 0.13.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: %run loaddata.py
If you’re not familiar with IPython you can get more information from ipython.org. The next thing we need to do is import pandas (the convention is to import it as ‘pd’) and for convenience we’re also going to import pandas.DataFrame into our current namespace.
In [2]: from pandas import DataFrame
In [3]: import pandas as pd
Pretty painless! Next we need to set up a python list to hold the DataFrames we are going to create:
In [4]: scaffolds=[]
Next we’re going to walk through our JSON structure by extracting each view, then for each view we’re going to extract each note. For each note we’re going to pull out the scaffolds (called “supports”), the note’s ID, the creation and modification dates, the owners (i.e. the authors: notes can have multiple authors), and the view the note came from. We’re going to put that all into a pandas.DataFrame, and we’re going to append each dataframe to the list.
In [5]: for view in j['Result'][0]['contains^']:
....: for note in view['contains^']:
....: for owner in note.get('^owns'):
....: scaffolds_df = DataFrame(note['^supports'])
....: scaffolds_df['noteid']=note['#']
....: scaffolds_df['created']=note['crea']
....: scaffolds_df['modified']=note.get('modi')
....: scaffolds_df['owner']=owner.get('fnam')
....: scaffolds_df['view']=view['titl'].strip()
....: scaffolds.append(scaffolds_df)
....:
And then we’re going to concatenate all those dataframes down into one:
In [6]: scaffolds = pd.concat(scaffolds,ignore_index=True)
In [7]: scaffolds
Out[7]:
Int64Index: 197 entries, 0 to 196
Data columns:
# 197 non-null values
Object 197 non-null values
created 197 non-null values
modified 197 non-null values
noteid 197 non-null values
owner 197 non-null values
text 197 non-null values
view 197 non-null values
dtypes: int64(4), object(4)
So we see that in the 165 notes we have 197 scaffolds. Let’s write that out to a CSV file for later use:
In [8]: scaffolds.to_csv('scaffolds.csv')
We can use the crosstab feature in pandas to quickly create a crosstab to look at the relationship between people and scaffolds:
In [9]: scaffolds_crosstab = pd.crosstab(scaffolds.owner,scaffolds.text,margins=True)
In [10]: scaffolds_crosstab
Out[10]:
text A better theory Evidence I need to understand My theory \
owner
Andy 1 1 0 3
Harvie 0 0 1 4
Jill 0 2 3 6
Kassandra 1 0 4 4
Katy 0 0 2 1
Kaye 0 0 0 5
Lyn 0 0 2 8
Marion 1 0 1 2
Maybelle 0 0 3 5
Merlin 0 0 1 2
Michelle 0 0 1 0
Moreen 0 1 2 3
Panda 0 2 0 4
Quinton 0 0 0 2
Ravenna 0 0 2 4
Sawyer 0 0 3 2
Stanford 0 0 1 0
Steven 0 0 1 2
Tayler 0 1 2 3
Todd 0 0 2 1
Viviette 0 2 2 4
Yasmine 1 1 0 0
Zoie 1 0 5 2
All 5 10 38 67
text New information Putting our knowledge together Resource All
owner
Andy 2 0 0 7
Harvie 0 0 0 5
Jill 4 0 1 16
Kassandra 5 0 0 14
Katy 1 0 0 4
Kaye 3 0 0 8
Lyn 4 0 0 14
Marion 0 1 0 5
Maybelle 6 0 0 14
Merlin 2 0 0 5
Michelle 0 0 0 1
Moreen 4 0 1 11
Panda 1 0 0 7
Quinton 1 0 0 3
Ravenna 3 0 0 9
Sawyer 3 0 0 8
Stanford 2 0 0 3
Steven 6 0 0 9
Tayler 4 1 0 11
Todd 1 0 0 4
Viviette 5 0 0 13
Yasmine 5 0 0 7
Zoie 10 1 0 19
All 72 3 2 197
And of course we’re going to write that out to a CSV file for use elsewhere. ` In [11]: scaffolds_crosstab.to_csv(‘scaffolds_crosstab.csv’) ` So there it is: about 10 lines of pandas-enhanced python code to accomplish what we used to do in several hundred lines of perl code in the bad old days!
What have I learned from the output? “New information”, “My theory”, and “I need to understand” are the three most commonly used scaffolds. But that doesn’t really jump out at you from the crude table provided by the pandas output. We can do better!
Up next: Visual representations to aid data exploration.