In the post below I looked at the elements of tweets about #AFL and #NRL but the point of twitter is its connected nature. If tweets have an impact on people then they are retweeted. This allows us to look at the way tweets are shared within the communities that follow them. ANd, from that, we get a sense of how connected a community is by a topic.
To do this I used a program called Gephi which is a graphing program designed specifically for graphing networks. When I ran the data from the post below about #AFL and #NRL collected last weekend I found something really interesting. The #AFL tweets were retweeted 73 times. But the #NRL tweets were retweeted 149 times.
So, there are twice as many links between the tweeters about rugby as there are about AFL which suggests that the rugby community is better connected than the AFL community. The results are below:
#AFL Twitter Communitty
#NRL twitter community
Quite a few other interesting things about the football twitter communities are revealed by these graphs. Its pretty clear that the big media companies (especially Fox) have a big impact on the flow of information. On the right of the #AFL graph you can see @Foxsportsnow dominates that part of the network, as does @Foxnrllive at the bottom left of the #NRL graph. Another interesting thing about the structure of these networks is how poorly interconnected they are. The flow of information is highly directional, flowing out from the strong corporate media nodes to individual users. But, these individual users seem not to retweet each other’s tweets to the same extent.
Perhaps collecting tweets over a longer period of time would allow the person to person links in the network to be revealed…
So, my investigation of football social media in Australia continues. In the post below I showed that @abcgradnstand tweeted more about #AFL than #NRL creating a bias in the reporting of football codes in Australia. However, it was suggested to me that ‘may be Australian’s are just more interested in AFL than NRL’. But this sounded like an empirical question to which data could be collected and an answer obtained.
So, I used twitter’s streaming API to collect 300 tweets that included #AFL and 300 tweets that included #NRL on the 16/8/2014 and the results are in:
In fact, more people were tweeting about #NRL than #AFL, showing that rugby is of more interest to Australians than AFL.
But there is another interesting element in the data. When I split the results up by the location of the tweeter it shows that #NRL was tweeted about more than #AFL in both Sydney and Melbourne. So, even in the home of the AFL, people are more interested in rugby.
I’ve thought a bit more about the bias revealed in the analysis below in which it was shown that the official ABC sport news twitter handle reports more stories about AFL than NRL. While this quantitative analysis has revealed something meaningful about the bias in sport reporting, it might be even more useful to look closely at the content of the tweets to see if the ABC’s tweets about the AFL are not only more frequent, but also more positive.
So, I’ve written this code which parses the tweets into words and then plots the frequency of the most used words (I restricted the plot to avoid the most commonly used words such as ‘and’, ‘the’, etc).
This is the result for @abcgrandstand’s tweets about the AFL:
And this is the result for @abcgrandstand’s tweets about the NRL:
You can see that really a lot is very similar. There are words the occur frequently in both lists that would crop up in the reporting of any sport event on the planet such as ”sports”, “win”, “live” and “lead”.
But there are two interesting differences: The AFL related tweets have “Melbourne” as a frequently occurring word and the NRL related tweets have “Sydney” as a frequently occurring word. So, despite the efforts of both codes to cross state boundaries, there is still some dissociation between the NRL and its following in Sydney and the AFL in Melbourne.
The other interesting difference is that one of the frequently occurring words in the NRL (but not the AFL) is “action”. Which just goes to show, no matter what, rugby just has more action than AFL.
For a long time, various voices across the Australian political landscape (such as Senator James McGrath) have accused the Australian Broadcasting Corporation of bias. It tursn out that the data is pretty clear: the ABC does not report on the Labor Party more favorably than the Liberal party.
However, I suspect there is in fact bias at the ABC. Not in the reporting of politics, but in the reporting of sport. I believe the ABC is biased to run more stories about Australian Rules Football than about Rugby.
But, unlike Sen. McGrath and co. I’m not going to make idle claims…I’ll check the data.
To do this, I needed a data set. The ABC runs a great twitter handle (@ABCgrandstand) which tweets about all the new stories on the ABC.net.au sport pages. And, there is a great function available on the MATLAB File Exchange called twitty which interfaces with the twitter API and pulls out lots of stuff that you might want to run your analysis on. One of the functions the twitter API has is called user_timeline and it pulls out every tweet a specified user has posted (with some limits).
So, I used twitty and wrote this code to pull out all the tweets from @ABCgrandstand between 7 June and 18 July (there were 2107 tweets in this window). I then checked each tweet for the strings ‘AFL’ and ‘NRL’. And this was the result:
There is a bias! The ABC reports on more stories about the AFL than the NRL.
So, its clear where the bias is in the ABC!
The next step will be to check the sentiment of the tweets about the AFL and NRL to see if the ABC reports more favorably about the AFL too.
I was talking to my friend Nathan (he of the whimsical p value Doge and the informative p value research) about teaching correlations to undergraduates. We thought it might help if the students could see the effects on size and significance of different data quickly – by clicking in data points themselves.
So, I wrote this code.
It opens a user interface figure with x and y axes. Students can then click in data points and watch a positive correlation grow greater in significance as more points are added but with little change to size because the points are all close to the least squares line:
Or see the effect of an outlier driving a correlation. Here there is no significant correlation between the first nine points but the addition of the tenth suddenly creates a significant correlation with a large magnitude.
This of course reminds students to visualise to check for (amongst other things) outliers in the data and not just report the numbers.
So much data exists freely available on the net that we can now answer interesting questions by interrogating these data rather than recruiting participants and asking them. The advantages are numerous and several academics I respect including Prof. Dorothy Bishop have extolled the virtues of using data scraped from the internet.
So, I thought I’d give it a go. There is a great MATLAB function available from the FileExchange which downloads the content of URL tables from the internet. I pointed it at two great sites which house the results of more than a hundred years of sports matches played in Australia in Australian Rules Football and Rugby League.
I’ve had a suspicion for a few years that the home ground advantage in Australian sports has been declining. I imagine that part of what makes up a home ground advantage is the benefit of a crowd which is enthusiastically partial to the home team. However, in the 21st century, I suspect that there is more movement of people around the country so fans are often more likely to attend away games. Another part of the home ground advantage is, I suspect, the familiarity the home team has with the ground. But there are now several ‘shared’ grounds in use, particularly in the AFL.
So, I wrote this script to get the data, calculate the proportion of home side wins per season and graph it… (click on the figure to enlarge).
As you can see, it doesn’t look like there is a clear trend in either code over time. So, my suspicions are not confirmed. There is of course much more that could be done to analyse these data (points difference, attendance, the role of specific teams, etc).
But, it was fun today just to start scraping data from the internet. My next target will be a little less frivolous…