Search Nuggets » André Lynum

Voting patterns at the Norwegian parliament

André Lynum — Thu, 30 Jul 2015 11:42:37 +0000

A couple of weeks ago we saw the blog post visualizing the voting patterns in the Polish parliament. In anticipation of the upcoming election and in the interest of checking up on our elected representatives we thought we would do a similar analysis for the Norwegian parliament. First we will visualize a projection of the voting data of the Norwegian parliament into 3D, and then we will try to interpret the axes of the projection in terms of the issues voted on.

The data

Stortinget, the Norwegian parliament, has a nice web service for retrieving details about the process an issue passes through in the parliament. Proposals, committees, referendums and so on; there is a surprising amount of detail in there. With a bit of work we pulled out the individual votes for the parliamentary referendums we were able to access. On the less bright side the data doesn’t appear to be complete as far as we can see, and there is also a range of documented caveats and restrictions regarding which votes are actually registered in the system. This reduces the usefulness of any detailed analysis based on this data, but we still think the aggregated picture one can create is both interesting and valid.

The voting data comes in the form of for, against or abstained votes. Like in the referenced blog post we would like to visualize the similarity in voting patterns for the representatives. In order to do this we constructed a grid of referendums versus representatives for the current parliamentary period with a single cell representing a vote encoded as 1 for “for”, -1 for “against” and 0 if “abstained” which makes this a neutral midpoint in the analysis we are going to do. This gives a data point for each representative in the multidimensional “vote space” where we can do things like clustering or similarity measures to find patterns in the votes.

Visualizing the vote space

Since we have the records for 58 referendums so far in the 2013-2017 period and 612 records in the 2009-2013 period we cannot directly visualize our representatives in the vote space. There are several techniques for projecting data like this into lower dimensional sub-spaces while still retaining the essential characteristics of the data. In this blog post we will project the voting data into 3 dimensions using Principal Components Analysis. There are several alternatives here and we chose PCA for the following reasons.

Our data is categorical, but non-binary. It is also distributed symmetrically with 0 representing a neutral value. One would expect that categorical data would not necessarily be modeled well by PCA, but the symmetry and the consistent values involved combined with the density of the data suggests that PCA should behave well.
PCA is based on Singular Value Decomposition (SVD) of the co-variance matrix and consequently has properties that are straightforward to interpret compared to methods such as Multidimensional Scaling (MDS) which seeks to preserve the individual distances between data points.
PCA gives us a projection along axes which have the most variance which in our case should highlight the parts of the data where there is less agreement between parties or representatives. This disagreement is the sort of contrast that we seek to visualize.
PCA axes are linear combinations of the separate referendums which means we can create a comprehensible interpretation of the visualization in terms of political issues.

Other methods are generally harder to interpret with regard to their applicability and the resulting projections (like probabilistic methods such as ICA) or don’t highlight the contrast we want to visualize (like MDS which preserves the distance relationships between the representatives).

The plots

We’re just going to present the plots without any political commentary. The first plot is for the current parliamentary period while the second is for the previous period. In the first we have highlighted Miljøpartiet De Grønne which only has a single representative in the current parliament. It is interesting to note the coherence of certain coalition partners in government and the location of the centrist parties. An important measure for the PCA projection is the explained variance ratio of the axes. This is the amount of variance that is preserved in the projected axes and is a measure of how much of the original information is retained in the projection. For our plots the total explained variance is ca. 62% and 57% which means there is quite a bit of diversity in the voting patterns that is not shown in the plot but still enough to consider them informative. The axes are ordered by explained variance so that the first axis explains over 40% of the variance while the remaining two around 10% each.

Projection of voting patterns in the 2013-2015 parliamentary periods.

Projection of voting patterns in the 2009-2013 parliamentary periods.

Interpreting the projection axes

In and of itself the colorful dots have a limited story to tell. If we could characterize what the axes in the graphs represent we could draw more interesting and detailed inferences. What we will do here is mostly for illustrative purposes though, since it really requires someone knowledgeable in Norwegian parliamentary politics and procedure to build an analysis with proper grounding in actual political activity. Here we will just play around with the data. It is tempting to treat the positive and negative directions of the axes as affirmative/negative on an issue but we would have to look at the text for each referendum to see what “for” and “against” actually means with regards to the sentiment on each issue. Actually the same issue tend to have both strong positive and negative components in a projected axis which strongly suggests that there isn’t a single sentiment expressed by the axis direction. Can we create a credible summary of the axes without closely studying each referendum and each issue? Each referendum is part of an issue and it turns out each issue has a list of topics associated with it. So we can make a summary of how much each topic is represented in an axis by the weight for each vote on this issue in a given axis. Each axis is a linear component of the votes and we add up the absolute value of each vote that is part of an issue concerning the topic. To class up this blog a bit we decided to make a word cloud out of the results.

Word cloud for PCA component 1 for the 2013-2015 plot.

Word cloud for PCA component 2 for the 2013-2015 plot.

Word cloud for PCA component 3 for the 2013-2015 plot.

The most characteristic topics

Some topics account for most of the activity in parliament and tend to dominate the data if we look only at the volume of votes. For the present parliamentary period the data is rather sparse so this isn’t as pronounced in the word clouds, but for the 2009-2013 period where there is a lot more data transportation/communication and health dominates in all the axes. Can we see what topics are characteristic for an axis instead? To highlight topics that have a high overall weight in our projection in comparison to their overall presence in the referendums we weigh each topic by its “Inverse Topic Frequency” – similar to the common Inverse Document Frequency (IDF) common in search relevance and information retrieval. This makes rarer but highly weighted topics stand out. This weighting gives us a clearer picture of how the axes differ from each other, even if they don’t necessarily show the ratio of influence of these topics on the projection itself.

Contrastive word cloud for PCA component 1 for the 2009-2013 plot.

Contrastive word cloud for PCA component 2 for the 2009-2013 plot.

Contrastive word cloud for PCA component 3 for the 2009-2013 plot.

Wrapping up

While the plots shown here are both interesting and entertaining there are many possibilities available when creating these visualizations and interpreting them. Consequently while they can be very helpful during analysis and point out interesting directions one might not have noticed otherwise, a purely quantitative analysis can fall prey to “researchers degrees of freedom” and steer conclusions towards predetermined biases and worse. External validation and domain expertise can help ground models and inferences in reality. With the help of a domain expert we could shape the data into a format more amenable for analysis, for example by weighting issues by importance and normalizing the vote sentiment across referendums. This would provide us with a much firmer foundation for making inferences than the raw data by itself as we have done here.

How Elasticsearch calculates significant terms

André Lynum — Wed, 10 Jun 2015 11:02:28 +0000

The magic of the “uncommonly common”.

Many of you who use Elasticsearch may have used the significant terms aggregation and been intrigued by this example of fast and simple word analysis. The details and mechanism behind this aggregation tends to be kept rather vague however and couched in terms like “magic” and the commonly uncommon. This is unfortunate since developing informative analyses based on this aggregation requires some adaptation to the underlying documents especially in the face of less structured text. Significant terms seems especially susceptible to garbage in – garbage out effects and developing a robust analysis requires some understanding of the underlying data. In this blog post we will take a look at the default relevance score used by the significance terms aggregation, the mysteriously named JLH score, as it is implemented in Elasticsearch 1.5. This score is especially developed for this aggregation and experience shows that it tends to be the most effective one available in Elasticsearch at this point.

The JLH relevance scoring function is not given in the documentation. A quick dive into the code however and we find the following scoring function.

0 \\ 0 & elsewhere \end{matrix}\right. ' title=' JLH = \left\{\begin{matrix} (p_{fore} - p_{back})\frac{p_{fore}}{p_{back}} & p_{fore} - p_{back} > 0 \\ 0 & elsewhere \end{matrix}\right. ' class='latex' />

Here the is the frequency of the term in the foreground (or query) document set, while is the term frequency in the background document set which by default is the whole index.

Expanding the formula gives us the following which is quadratic in .

By keeping fixed and keeping in mind that both it and is positive we get the following function plot. Note that is unnaturally large for illustration purposes.

On the face of it this looks bad for a scoring function. It can be undesirable that it changes sign, but more troublesome is the fact that this function is not monotonically increasing.

The gradient of the function:

Setting the gradient to zero we see by looking at the second coordinate that the JLH does not have a minimum, but approaches it when and approaches zero where the function is undefined. While the second coordinate is always positive, the first coordinate shows us where the function is not increasing.

Furtunately the decreasing part of the function is in an area where and the JLH score explicitly defined as zero. By symmetry of the square around the minimum of the first coordinate of the gradient around we also see that the entire area where the score is below zero is in this region.

With this it seems sensible to just drop the linear term of the JLH score and just use the quadratic part. This will result in the same ranking with a slightly less steep increase in score as increases.

Looking at the level sets for the JLH score there is a quadratic relationship between the and . Solving for a fixed level we get:

Where the negative part is outside of function definition area.
This is far easier to see in the simplified formula.

An increase in must be offset by approximately a square root increase in to retain the same score.

As we see the score increases sharply as increases in a quadratic manner against . As becomes small compared to the growth goes from linear in to squared.

Finally a 3D plot of the score function.

So what can we take away from all this? I think the main practical consideration is the squared relationship between and which means once there is significant difference between the two the will dominate the score ranking. The factor primarily makes the score sensitive when this factor is small and for reasonable similar the decides the ranking. There are some obvious consequences from this which would be interesting to explore in real data. First that you would like to have a large background document set if you want more fine grained sensitivity to background frequency. Second, foreground frequencies can dominate the score to such an extent that peculiarities of the implementation may show up in the significant terms ranking, which we will look at in more detail as we try to apply the significant terms aggregation to single documents.

The results and visualizations in this blog post is also available as an iPython notebook.

Using iPython notebooks and Pycharm together

André Lynum — Mon, 11 May 2015 11:12:24 +0000

IPython notebooks have become an indispensable tool for many Python developers. They are a reasonably good environment for interactive computing, can contain inline data visualisations and can be hosted remotely for sharing results or working together with other developers. In many academic environments and increasingly in industry IPython notebooks are used for data visualisation work and exploratory programming, depending on the IPython interactive environment for fast prototyping of ideas.

As nice an environment we have in IPython, I often wish for the features of a full-fledged IDE. Here at Comperio we use PyCharm a lot which has excellent code editing, semantic completion, a graphical debugger and efficient code navigaton capabilities. In this blog post I’m going to show how you can simultaneously work on code in both the IDE and IPython notebook or interactive shell while keeping the running notebook and IDE project in sync.

Hey, PyCharm already have IPython notebook integration. What about that? Personally I find that the IPython notebook integration in the latest PyCharm (version 4.0.6) still isn’t adequate for serious work. You get the the completion and code navigation from PyCharm, but editing and navigation is reduced to half a dozen buttons. Further some functionality such as debugging appears to be plainly non-functional. Regardless there are other very nice IDEs for Python such as Wing or Eclipse, and the approach here will work equally well with them.

This cunning recipe consists of two spicy ingredients, Both are neat tricks on their own, but together they form a smooth workflow bridging exploratory programming and more structured software engineering. We are going to:

Install our code as an editable Pip package.
Use the IPython autoreload extension to dynamically reload code.

So let’s get cooking!

Editable Pip packages

We are going to organise our code in a Python package and install it with Pip using the -e or —editable option. This installs the package as it is pointing to our project directory and that we are always importing the code that we are editing. We could also accomplish this with some hacking on sys.path or PYTHONPATH, but having our code available as a package is a lot more seamless. It makes sense to use virtualenv (or EPD/Anaconda environments) to isolate your system Python from your development packages.

First we create Python project in PyCharm, add source folder with setup.py defining a basic python package.

Then we create stub file with the following code in our python module and a folder for our notebooks.

def get_page():
    print "Don't know how to do this yet."

And we activate our virtualenv/Conda environment and run pip install -e .

Pip and Git: If you install your package with -e from a Git repository it may think that you want to install from Git even if you’re giving it a file path. This is usually not what you want when you’re developing since you would have to commit your code for the package to update itself. An ugly but practical way to avoid this behavior is to move the .git folder out of the way when installing the package.

%autoreloading code

Now to the important part which is dynamically loading the IDE project into our IPython notebook. Let’s first fire up the notebook.

Start iPython and create a notebook.

You have probably used reload(module) to update the Python environment at runtime. This hardly ever works for more than five minutes and results in an inscrutable mess of old new stuff in your modules and classes.There are however a bag of neat tricks taking care of at least the majority of the problems around reloading Python code or modules (see http://pyunit.sourceforge.net/notes/reloading.html), and the IPython developers has collected these into their autoreload extension. Let’s look at it in action.

Here we set up the autoreload module and import our stub function in the first two cells. In the third we run our function. We then change the function definition in the IDE and save the file.

def get_page():
    print "Hey I'm updated."

And when we run the function again in cell four the updated code is run.

From notebook to project and back

Combining editable Pip packages and the autoreload module we have a way to seamlessly load our project code in the notebook. When we are ready to move our exploratory programming back to project we can move our code over, import any new definitions and refine our implementation while using it in our notebook. In this way we can quickly move from noodling around in the notebook to developing and testing in the IDE and move back to the notebook to use our project code in further unstructured meanderings.

In the next post we will demonstrate this in more detail.

Beer and searching at Elasticon

André Lynum — Sun, 08 Mar 2015 22:58:16 +0000

Christoffer was pacing angrily back and forth, Nøgne Ø IPA in his left hand, phone in the other. I was looking at the the long list of cancellations, including our connecting flight to Arlanda on the way to SF. The Norwegian strike was hitting hard with nearly no planes flying in Europe.

Arlanda airport is fairly described as the butthole of the world. Filled up with angry swedes and featuring the worlds slowest transfer security gate. We had booked the flight being fully aware of the pain and the risks and now the plane wasn’t even going to land there.

“Listen up” Christoffer was growling between gulps of strong IPA, “There’s no way we’ll be planted in your crap airport for over 12 hours bullshit strike or not.”

“Calm down” I looked up at the furious bearded giant. “You’re rattling my nerves.”

“Besides we still have CEOs credit card. Get some other plane and upgrade our tickets to business class while you’re at it. I’ll need some rest and proper legspace after all this crap.”

…

Out of breath after a mad dash to the airport service desk with Christoffers 60 pound america suitcase in tow, we were swaying dangerously with a Nøgne Ø beer heavy on our breath.

“We need tickets for the next plane to Arlanda. It’s of the utmost importance that we reach Arlanda as quickly as possible!”

The man at the counter tried his best to ignore us. Maybe we skipped line, I don’t know.

“Listen up, we are going to Elasticon in SF. The most degenerate collection of search professionals in the world. It’s a nasty assignment but somebody has to do it. We need to catch our flight from Arlanda and all this strike bullshit has left me with no patience!”

The man at the desk was sensíng an ugly scene. Two mean IT bums with their CEOs card and shot nerves. He ignored the shouting and commotion behind us and got to work.

…

Feet up in bclass seats on our way to SF we finally got some rest. Christoffer scanning the bclass microbrewery selection with a critical eye. “I need an imperial stout! In case of turbulence you know, need the extra weight. Maybe two or three even.” “Suit yourself” I said nipping a light lager while Christoffer was waving and hollering at the flight attendants. “Turbulence can be heavy shit for sure.”

…

Coming down in SF we knew we had to tighten up in front of the nasty customs procedure awaiting us. Christoffers america suitcase was another concern, but we were banking his large collection of home grown hops and rare yeasts not triggering any bomb sensors or something. The trick is to stomp your toe into something right before approaching the customs official. This will give you the steely stare and tight grimace needed to pass muster at the in front of the customs official asking why you’ld ever come here and if you have the means to butt yourself out before you become too much of a nuisance. “It’s a conference” I’ve wheezed through my gritting teeth. “But it might as well be a madhouse. These people are serious business”. The man behind the counter wasn’t really satisfied but didn’t press any further, and we hobbled on to pick up Christoffers gigantic trunk at arrivals.

…

Eleven hours of dry recycled air and flat beers can break any man, and stumbling out of Oakland airport we knew we were in a particularly bad shape. After his half a dozen heavy stouts Christoffer had entered a nasty beer induced coma and snored like an annoyed elephant for nine consecutive hours. Now he was wide awake of course ready for action while the rest of us hadn’t slept for 28 hours or so. Jetlag was about to do a double trick on us and I knew we had to take action.

There is only two remedies available to a man facing the disorientation and confusion of a serious jetlag and that is either to desperately stay awake long enough or induce a serious comatose sleep by an means necessary.
“If we hurry we can get into some dive before it’s too late. Load you up and then get you back to the hotel so I can get some proper rest.” I said. A plan I deeply regretted trying to keep myself together in a dingy Liars Dice den in Chinatown watching Christoffer enjoying a Sierra Nevada pale ale. The sharp rattling of dice cups among huge piles of dollar bills on the bar counter was jangling my nerves and leaving me with a sense that we had embarked something quite different than what we had signed up for …

Comperio goes to Elasticon

André Lynum — Fri, 27 Feb 2015 14:24:38 +0000

Elasticon, the first Elasticsearch user conference, is coming in a couple of weeks. Hosted in San-Francisco, the agenda promises a lot of interesting use cases and in-depth information about Elasticsearch and the ELK (Elasticsearch, Logstash, Kibana) analytics stack. It is an 11 hour plane trip away, but here at Comperio we consider Elasticsearch one of the most exciting developments in search today. Not only is it a scalable and flexible search platform but it is also at forefront of combining data analytics, text mining and information retrieval in a single scalable and cohesive platform. So it doesn’t matter if Elasticon is nearly on the other side of the planet, we can’t really miss this opportunity to get on top of the latest developments surrounding Elasticsearch.

At Comperio we’re happy to see that search is becoming so much more than it used to be. Elasticsearch has proven to be a platform that not only does search well, but also integrates documents with data in a way that enables information oriented applications. At the front of this wave of new applications is the ELK stack which makes it easy to build a complete pipeline for analytics. Search analytics, system monitoring or web analytics are all areas where a realtime reporting platform can be built on top of ELK. Combining the data analytics capabilities of Elasticsearch also allows us to build information insight driven applications for our customers, combining offline and realtime text analysis and data aggregation with web based visualization based on D3.js. Especially the real time queries has made it possible for users to drill-down into and compare specific pieces of data in an exploratory manner. These new applications does not only require a fast and scalable technology core, but also solid insight into search technology and an ability to modify it for specific requirements.

On to Elasticon it is no surprise that there is a huge focus on all aspects of ELK, We’re looking forward to see how other companies are adopting the ELK stack to their projects and get new ideas of how we can help our customers bring out the value in their data with the ELK software. There are also some sessions on Shield, the new access security subsystem for Elasticsearch. This is a long asked for component which will surely make integration easier in many projects. We’re also looking forward to the sessions on Elasticsearch and ELK internals as we are now looking into several extensions that we’d like to implement in Elasticsearch.

On to the sessions we’re looking forward to the ELK use cases, especially the “Tackling Security Logs with the ELK” and “The ELK & The Eagle: Search & Analytics for the US Government”. Also ELK internals sessions such as “The Contributor’s Guide to the Kibana” and “Life of an Event in Logstash”, in addition to “Elasticsearch Architecture: Amusing Algorithms and Details on Data Structures”. Building data centric applications is also covered in “Using Elasticsearch to Unlock an Analytical Goldmine”, “Navigating Through the World’s Encyclopedia” and “The ELK Stack for Time Series Data” which we hope can give us some fresh viewpoints on using Elasticsearch in our projects related to intelligence and data mining.

Hope to see some of you in San Francisco :)

Some CSV import tricks in Neo4j

André Lynum — Wed, 04 Feb 2015 12:46:57 +0000

The CSV file import facility in Neo4J is interesting in that it allows you to run Cypher queries iteratively over your dataset. This gives us a lot of flexibility and relieves us of the need for transforming our data to a Neo4J specific format. We can export tables with for example foreign keys to other tables and reconstruct our relationships during import. The key to doing this successfully is adding nodes and relationships on demand with the MERGE clause.

There are some quite good resources on CSV import in Neo4J already. Instead of repeating it we’ll just drop the linkshere:

http://neo4j.com/docs/stable/queryn-load-csv.html
http://neo4j.com/docs/stable/query-load-csv.html
http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-and-successfully/
http://blog.graphenedb.com/blog/2015/01/13/importing-data-into-neo4j-via-csv/
http://jexp.de/blog/2014/06/using-load-csv-to-import-git-history-into-neo4j/

To recap the main points:

Set up constraints and indexes before importing. Without them the import queries will be very slow
Use USING PERIODIC COMMIT to speed up the importing process.
Make sure your heap is big enough, especially if committing multiple queries. It is recommended that the entire dataset fits in the cache (see http://neo4j.com/docs/stable/configuration-caches.html#_file_buffer_cache).
Values are read as strings and must be converted before insertion.
You can provide defaults with the COALESCE clause.
Use MERGE and MATCH to add data duplicated across tables or in denormalized tables.

With these main points in mind we’ll look at some examples and tricks.

MERGE instead of CREATE UNIQUE

In general it seems better to use MERGE rather than CREATE UNIQUE. This last clause tend to have some surprising corner cases, while MERGE on the whole is behaves for more predictably. It is important to remember though that MERGE will create any node that in a pattern if the whole pattern doesn’t match if constraints are not violated. Often that is not what you want, and instead you need to use MATCH to find existing part of a pattern and call MERGE matching just the part of the pattern you intend to create.
In this example we don’t have uniqueness constraints on the user property and matching the in the MERGE would create duplicate :USER nodes.

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM “file://blabla.csv” AS csvLine
MATCH (uf :USER { twitter_name: toInt(csvLine.follower_name) })
MATCH (u :USER { twitter_name: toInt(csvLine.twitter_name) })
MERGE uf -[:FOLLOWS]-> u

Skipping incomplete rows

This is rather straightforward with a WHERE keyword, but is made more complicated than it needs to because it has to be attached to clause. Most straightforward way to introduce a clause is with the WITH statement, but as WITH separates parts of the query we need to parse the whole CSV line here and bind it in the clause. Accessing the CSV line object won’t work within the clause.

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM “file://blabla.csv” AS csvLine
WITH toInt(csvLine.twitterid) as twitterid, toInt(csvLine.seat) as seat, csvLine.firstname as firstname,
csvLine.lastname as lastname, csvLine.party as party, csvLine.region as region, csvLine.type as type
WHERE twitterid IS NOT NULL
MERGE (u :USER { seat: seat, firstname: firstname, lastname: lastname, party: party, region: region,
type: type, twitterid: twitterid })

Accumulating relationship strength

In many cases the rows in a CSV file describes a list of relationships where duplicate lines describes a recurring and stronger relationship. Here we’d rather not create multiple links and have to count the link cardinality to get the strength between two nodes, but rather accumulate the repeated relationships in a property on the relationship. This is easy using the ON MATCH and ON CREATE part of a MERGE clause.

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM “file://blabla.csv” AS csvLine
MATCH (u :USER { twitterid: toInt(csvLine.twitterid) })
MERGE (h :HASHTAG { text: csvLine.text})
MERGE u -[t :TWEETED]-> h
ON MATCH SET t.count = t.count + 1 ON CREATE SET t.count = 1