How we used machine learning to cover the Australian election |

During the final Australian election we ran an formidable challenge that tracked marketing campaign spending and political bulletins by monitoring the Facebook pages of each main get together politician and candidate.The challenge, dubbed the “pork-o-meter” (after the time period pork-barreling), was massively profitable in having the ability to determine distinct patterns of spending based mostly on vote margin, or incumbent get together, with marginal electorates receiving billions of {dollars} extra in marketing campaign guarantees than different electorates.All up, we processed 34,061 Facebook posts, 2,452 media releases, and revealed eight tales (eg right here, right here and right here) as well as to an interactive characteristic. We additionally used the identical Facebook knowledge to analyse pictures posted throughout the marketing campaign to break down the commonest kinds of picture ops for every get together, and the way issues have modified since the 2016 election.We have been ready to uncover greater than 1,600 election guarantees, amounting to tens of billions of {dollars} in potential spending. Our textual evaluation later discovered virtually 200 (112 in marginal seats) of the Coalition’s guarantees have been explicitly conditional on their profitable the election. This means a lot of the targeted-largesse might by no means have been extensively recognized with out our challenge.Anthony Albanese talking to the media at a Toll warehouse, annotated by an object recognition mannequin Photograph: FacebookTeasing out a number of hundred election guarantees from hundreds of thousands and hundreds of thousands of phrases is like discovering a needle in a haystack, and would have been in any other case inconceivable for our small group in such a short while body with out making use of machine learning.Because machine learning remains to be one thing of a rarity on the reporting aspect of journalism (so far as I do know this challenge is a primary of its type for the Australian media, with different ML makes use of principally targeting content material administration techniques and publishing), we thought it will be worthwhile to write a extra in-depth article on the strategies we used, and the way we’d do issues in a different way if we had the likelihood.The drawback (posts, a lot of posts)Commuter carparks. Sports rorts (model one and two). CCTV. Regional improvement grants. Colour-coded spreadsheets and nice huge whiteboards.The historical past of elections and authorities funding in Australia is affected by allegations and stories outlining how each main events have directed public cash in direction of specific areas, whether or not it’s to shore up marginals or reward seats held by their very own members.However, typically these stories come nicely after the cash has been promised or awarded, following audits or detailed reporting from journalists and others.For the 2022 election we wished to monitor spending and spending guarantees in real-time, and preserve monitor of how a lot cash goes in direction of marginal seats, and the way this compares to what every seat would obtain if the funding was shared equally.Facebook picture posted by Scott Morrison with annotations from Guardian Australia’s canine detection mannequin, and a Facebook picture posted by MP Trevor Evans, displaying Evans holding an enormous cheque at the Channel 9 telethon Photograph: FacebookHowever, to do that we’d want to monitor each election announcement made by a politician, from lowly backbenchers promising $1,000 for a shed to billion-dollar pledges by get together leaders. From following chief bulletins in 2016, we knew that bulletins might seem in media releases, native media, and on Facebook.We determined to give attention to Facebook and media releases posted on private and get together web sites.To collect the knowledge we used the Facebook API to collate politicians’ posts right into a SQLite database, and wrote web-scrapers in Python for over 200 web sites to get textual content from media releases which was then additionally saved in SQLite.The greatest problem was then how to pull out the posts that had funding bulletins in them from the relaxation.The resolution (machine learning and guide labour)The output we wished was to have a last database of solely election spending guarantees, categorised into classes, corresponding to sport, group, crime, and so forth. Each promise would even be assigned to both a single voters or state or territory, relying on the location which might profit most from the spending. This would enable the needed evaluation by seat and get together standing we’d want for information tales.Our preliminary method was to classify two weeks’ value of Facebook posts. This preliminary evaluation confirmed some commonalities in posts and releases that contained election guarantees. These included references to cash, mentioning particular grant applications, and a few key phrases. But simply choosing posts that contained these options would have missed so much and had a really excessive false optimistic fee.So we went with a blended method. We used pre-trained language fashions to extract key phrases, geographic areas, grant program names, named entities (like the Prime Minister), and any references to cash. We then manually categorised 300 randomly chosen posts as both containing election guarantees or not. We lemmatised every phrase (turned them into their dictionary kind, eradicating tense and pluralisation and many others.), and turned every textual content right into a collection of numbers (a phrase embedding, or vector).The vectors have been created utilizing time period frequency-inverse doc frequency (tf-idf), which assigns values based mostly on how widespread a phrase is in a single textual content in contrast to the remainder of the texts. This emphasises a few of the variations between the texts, and along with the cosine similarity (based mostly on the angles of the vectors if plotted), allowed us to group posts and releases that have been possible about the identical matter.Finally, we skilled a logistic regression mannequin utilizing the posts we had already manually categorised. A variety of different machine learning methods have been examined, however logistic regression was persistently the most correct for our binary classification job – election promise or not.With the classifier skilled and all the extraction scripts setup, we created a pipeline the place all new posts had pertinent options extracted after which a prediction was made. Any submit that had a mixture of options and was predicted to include an election promise was flagged for guide assessment. Any media launch that was dissimilar (based mostly on cosine similarity) from the Facebook posts have been equally processed and flagged for assessment. We repeatedly retrained our classifier all through the election marketing campaign as we bought increasingly more confirmed knowledge.Once the classifier was up and operating, our method was:
Scrape Facebook posts and media releases
Run posts by means of our classifier and duplicate checker
Manually examine posts flagged as bulletins and take away duplicates, add different classes and particulars wanted
Find any media releases that have been dissimilar to the Facebook posts and course of them
Manually double-check all the knowledge earlier than publishing
Things we learnedDespite the automation, this course of was nonetheless time-consuming. However, we have been ready to run the challenge in a marketing campaign week with two days of labor from two journalists and an intern working three or 4 days (with additional time from information and political reporters on the precise tales). Without the automation and machine-learning aspect of issues, the identical challenge would have required fairly many extra folks to obtain the identical lead to the identical time.This was our first try at such a big machine learning and pure language processing challenge, and there’s fairly a bit for us to take away and enhancements that could possibly be made.For starters this challenge was virtually solely performed utilizing our work laptops, and so decisions have been made that finest utilised the laptop energy we had out there. During testing we performed with extra sophisticated strategies to create phrase embeddings, corresponding to Google’s BERT transformer. This would have allowed us to protect some extra of the context inside our corpus. However, these strategies took so lengthy on a laptop computer, that we reverted to a less complicated strategies of encoding. If we do a challenge like this once more we’d possible be higher off offloading the computational duties to the cloud, which means we might make use of extra strategies like deep learning and fashions like BERT.There’s additionally a whole lot of experimenting left to do with the textual content preparation. We didn’t mess a lot with the phrases in the textual content throughout our preprocessing. Apart from lematising the phrases we eliminated solely the commonest English filler phrases. However, eliminating a extra intensive record of phrases devoid of which means might scale back a few of the noise in our knowledge and make it simpler to determine the guarantees. We might additionally strive coaching our mannequin with textual content comprising n-grams, b-grams, or combos that embody elements of speech. All of this would possibly present extra context for the machine learning mannequin and enhance accuracy.We wrote a number of helper scripts meant to support our guide assessment, corresponding to by turning mentions of cash into actual numbers ($3m to 3,000,000). However, we solely scratched the floor right here. For occasion, we didn’t delve a lot into language fashions and elements of speech to programmatically determine and take away re-announcements of election guarantees, which was all accomplished manually. This may also have been achieved by means of its personal machine learning mannequin if we had skilled one.Bonus spherical: coaching an object recognition mannequin to recognise novelty cheques and hardhatsWhile the strategies above labored for the textual content of the Facebook posts, it couldn’t do a lot for the pictures posted by politicians.So, we discovered ourselves asking an essential query. Could we use machine learning to spot pictures of novelty cheques? Having one other mannequin in place to discover huge cheques and certificates in pictures would possibly choose up issues we’d missed in the textual content, and in addition it was fairly humorous.Giant cheques have made information in earlier years – in 2019 when the former Liberal candidate for Mayo, Georgina Downer, offered a grant to a bowling membership regardless of this observe often being the area of the sitting MP. A novelty cheque once more made headlines in 2020, when Senator Pauline Hanson introduced a $23m grant for Rockhampton stadium.With this in thoughts, we skilled an object recognition mannequin to spot big cheques. And from there it was a brief step additional to take a look at different widespread tropes of election marketing campaign picture ops: hi-vis workwear and hardhats, cute canine, and footballs.We selected these as they have been both already out there in pre-trained fashions corresponding to Coco, or had publicly out there picture datasets for mannequin coaching.For the object detection machine learning course of we used the ImageAI Python library, which relies on TensorFlow.ImageAI made it simple to get going with out understanding an excessive amount of about the underlying tech, but when we did it once more I believe we’d go immediately to TensorFlow or PyTorch. When it got here to determining issues with our fashions and mannequin coaching there wasn’t a lot documentation for ImageAI, whereas TensorFlow and PyTorch are each widely-used with giant communities of customers.Another possibility is an API-based method, corresponding to Google’s Vision AI, however value was an element for coaching fashions, which we’d want to do if we wished to detect novelty cheques.For every of the classes of hi-vis workwear, hardhats and novelty cheques we skilled a customized YOLOv3 object detection mannequin. Hi-vis was based mostly on a publicly out there dataset of 800 pictures, whereas hardhat detection was based mostly on a publicly out there dataset of 1,500 pictures. Dogs, sports activities balls and folks have been detected utilizing a RetinaNet mannequin pre-trained on the Coco dataset.Anthony Albanese holding a small canine on the marketing campaign path, annotated by Guardian Australia’s canine detection mannequin Photograph: FacebookFor cheques, we collated and labelled 310 photographs of big cheques to prepare the cheque-detection mannequin. Again, if we did it once more we’d in all probability spend extra time on the mannequin coaching step, utilizing bigger datasets and experimenting with grayscale and different tweaks.This time round, we skilled the fashions on a PC (utilizing the Windows Linux subsystem) with an honest graphics card and processor. While we additionally tried utilizing Google CoLab for this, operating the fashions domestically was extremely useful for iterating and tweaking totally different settings.Once we had the object detection fashions up and operating, we collated Facebook pictures for all Labor and Coalition MPs, candidates and senators for the 2019 and 2022 marketing campaign intervals, after which ran the object detection fashions over them. You can learn extra about the outcomes right here.The greatest concern with our method was the fee of false positives was fairly excessive, even with the bigger datasets used. That stated, the method was nonetheless a lot, significantly better than something we might have achieved by doing it manually.Note: whereas we would usually share the code for initiatives we weblog about, the machine learning elements of this challenge includes a whole lot of knowledge we’re not ready to share publicly for varied causes. When we get the time we would possibly add a stripped-down model of the challenge to GitHub later, and replace right here. You can nevertheless entry the last election guarantees dataset right here.

Recommended For You