Locke Data Blog
Locke Data helps organisations get started with data science. Grow your skills with our blog posts.
- April 19, 2018
- April 18, 2018
- April 6, 2018
- March 29, 2018
- March 23, 2018
- March 15, 2018
- March 7, 2018
- March 6, 2018
- February 28, 2018
- February 21, 2018
- February 12, 2018
- January 31, 2018
- January 29, 2018
- December 29, 2017
- December 20, 2017
- December 18, 2017
- October 5, 2017
- June 17, 2017
- June 13, 2017
- June 13, 2017
- June 7, 2017
- May 26, 2017
- May 22, 2017
- May 19, 2017
- May 15, 2017
- May 12, 2017
- May 9, 2017
- May 8, 2017
- May 5, 2017
- May 3, 2017
- May 2, 2017
- May 2, 2017
- April 28, 2017
- April 24, 2017
- April 21, 2017
- April 19, 2017
- April 18, 2017
Building your booth presence is the fourth instalment of the Sponsoring Community Events series aimed at helping companies get to grips with sponsoring community events and getting the most out of them. This post covers some of the things that you should be thinking about when you are planning on having a booth at an event.Read More
- April 14, 2017
- March 20, 2017
- March 6, 2017
I need your help.
Battle of the Beards is an annual tech event in Cardiff that’s previously been an evening affair but is now a day-long conference. We’re hosting half-hour talks on security, infrastructure, software craftmanship, front-end development, and data visualisation. We’re starting out the day with bacon baps and it just gets better from there. Tickets cost £15 and there’s the option to add a donation to the charity we’re supporting with the event, the Campaign Against Living Miserably.
Right now ticket sales are really low and without your help, we’ll have to cancel the event.
I’m hoping you can help make this event succeed by doing one or both of two things:
- register if it’s of interest to you
- recommend it to others
[button text=”REGISTER” color=”orange” link=”https://battleofthebeards.eventbrite.co.uk”] [button text=”TWEET” color=”blue” link=”https://twitter.com/home?status=Hey%20%23tweeps%20-%20%23battleofthebeards%20is%20on%20March%2029th%20in%20Cardiff.%20You%20should%20check%20it%20out!%0Abattleofthebeards.eventbrite.co.uk”]Read More
- March 1, 2017
Today in class, I taught some fundamentals of API consumption in R. As it was aligned to some Microsoft content, we first used HaveIBeenPwned.com‘s API and then played with Microsoft Cognitive Services‘ Text Analytics API. This brief post overviews what you need to get started, and how you can chain consecutive calls to these APIs in order to perform multi-lingual sentiment analysis.
UPDATE: See improved code in Using purrr with APIs – revamping my codeRead More
- February 27, 2017
A big part of why I’ve launched Locke Data is so that I can give back more to my communities. I want to give more time and more support to others. One of the first steps is doing some activities that give financial support to community groups without damaging my startup cashflow! Community R workshops that fund local user groups is the first activity I’ll be trialling.
Here’s what’s involved, and what you might want to consider if you’d like to be a part of this endeavour:Read More
- February 22, 2017
- February 20, 2017
Time series data is an important area of analysis, especially if you do a lot of web analytics. To be able to analyse time series effectively, it helps to understand the interaction between general seasonality in activity and the underlying trend.
The interactions between trend and seasonality are typically classified as either additive or multiplicative. This post looks at how we can classify a given time series as one or the other to facilitate further processing.Read More
- February 8, 2017
- January 13, 2017
The Cross Industry Standard Process for Data Mining (CRISP-DM) was a concept developed 20 years ago now. I’ve read about it in various data mining and related books and it’s come in very handy over the years. In this post, I’ll outline what the model is and why you should know about it, even if it has that terribly out of vogue phrase data mining in it! 😉
Data / R people. Do you know what the CRISP-DM model is?
— Steph Locke (@SteffLocke) January 8, 2017
- January 4, 2017
- December 1, 2016
- November 24, 2016
I’ve been writing (not enough) blog posts for a while now and have built up some neat stuff in the backlog if I may so myself. Alas, a lot of this doesn’t get seen because it’s not on the front page or in the top 5 blog posts. Sad that posts like my one on sixth normal form databases don’t get enough love, I’ve installed the WordPress plugin Revive Old Posts (ROP) to try countering this!Read More
- November 14, 2016
- November 8, 2016
- October 30, 2016
- October 28, 2016
- October 27, 2016
- October 21, 2016
- October 13, 2016
- October 10, 2016
This post will give you a quick run-through of adding tSQLt to an existing database project destined for Azure SQL DB. This basically covers unit testing in SSDT and there is a lot of excellent info out there, so this focuses on getting you through the initial setup as quickly as possible. This post most especially relies on the information Ed Elliot and Ken Ross have published, so do check them out for more info on this topic!Read More
- October 7, 2016
- October 6, 2016
It’s PASS Board of Directors elections again! After a number of twitter discussions last week about the applicability of PASS outside of the US and what I think PASS is good and bad at, I thought I would engage the process instead of just being a complainy-pants. I attended all 6 town hall webinars and asked questions to all the candidates. I recommend you watch them before voting.
Find out more about the candidates, the PASS Board of Director elections, and how to vote on the PASS website.Read More
- September 27, 2016
- September 17, 2016
- September 15, 2016
- August 30, 2016
- August 4, 2016
- August 2, 2016
- August 1, 2016
- July 20, 2016
From code in answers on Stack Overflow to R packages or full programs, there’s a lot of code being written and given away. This post examines some of the reasons why the people writing all that code do it, why you should consider giving back with code, and how you can get started. Finally, I cap it all off with perspectives from some of my favourite coders!
There are many reasons why you should consider writing code and making it available for public consumption.
- If you’re writing something to achieve a task, odds are someone else would have to write the same code – why not help them out?
- You’re using a lot of open source software, whether you realise it or not. By open sourcing your code, you get to pay it forward
- To give others something to contribute to
- Unknown quantities are risky hires, put your code out there for the world to see and employers get to see what you can do
- Develop your skills for the next job, the one that requires you to be more skilled in something than you are now
- You get to interact with a lot of different people who you build credibility with, and hopefully friendships!
- Generally speaking, the more code you write, the better your coding skills so if you want to improve your skills this is an ideal way to do it
- For the sheer fun of doing cool stuff, especially if you don’t get to do cool stuff in the day job
- To do it “the way it should be done”
- July 12, 2016
- July 11, 2016
I’ve recently been trying to solve the challenge of working extracting files from AWS and getting them into Azure in my desired format. I wanted a solution that kept everything on the cloud and completely avoid local tin. I wanted it to have built-in auditing and error handling. I wanted something whizzy and new, to be honest! One way in which I attempted to tackle the task was with Azure Automation. In this post, I’ll overview Automation and explore how it stacked up for what I was attempting to use it for.
Overall Task: Get compressed (.tar.gz) files from AWS S3 to Azure, decompress the files, concatenate the contents and put in a different container for analytics magic
Like with most things I dropped myself into the deep-end on it so had fairly minimal knowledge of PowerShell and the Azure modules, therefore I fully expect more knowledgeable folks to wince at my stuff. General advice, “you should do it like this, then this…”‘s, and resource recommendations are all very welcome – leave a comment with them in!
Azure Automation is essentially a hosted PowerShell script execution service. It seems to be aimed primarily at managing Azure resources, particularly via Desired State Configurations.
It is, however, a general PowerShell powerhouse, with scheduling capabilities and a bunch of useful features for the safe storage of credentials etc. This makes it an excellent tool if you’re looking to do something with PowerShell on a regular basis and need to interact with Azure.Read More
- July 7, 2016
- June 29, 2016
- June 23, 2016
I really liked the way Brent showed us his feedback received and since mimicry is the best form of flattery, I thought I’d go ahead and do it too!
I didn’t get any accepted abstracts, and I’m actually grateful. The recent stresses to do with the PASS dramas aside, I would have had to use 5 days holiday time, pay for flights and hotel, and then flown out a week later for MVP Summit. Now I can attend some other conferences and/or have a Christmas break! Woo hoo 😀Read More
- June 23, 2016
- June 10, 2016
- June 9, 2016
- June 1, 2016
- May 27, 2016
- May 11, 2016
- April 20, 2016
UPDATE 2016-10-21 : You can now get the ODBC 13 driver for Linux with a much smoother install process than below. Get all the relevant information on the announcement from the Microsoft SQLNCli team blog.
Did you know you can now get SQL Server ODBC drivers for Ubuntu? Yes, no, maybe? It’s ok even if you haven’t since it’s pretty new! Anyway, this presents me with an ideal opportunity to standardise my SQL Server ODBC connections across the operating systems I use R on i.e. Windows and Ubuntu. My first trial was to get it working on Travis-CI since that’s where all my training magic happens and if it can’t work on a clean build like Travis, then where can it work? Alas, the ODBC 13 driver doesn’t work Ubuntu 14.04 so this set of instructions has been modified to provide code for Ubuntu 15.04 only.
It works, but it’s really hacky right now. Definitely looking forward to the next iterations of this driver.
- This will work for Ubuntu 15.04 but 14.04 has a different set of C compilers
- This is currently hacky, and Microsoft are on the case for improving it so this post could quickly become out of date.
- Be very careful installing the driver on an existing machine. Due to the overwriting of unixODBC if already installed and potential compatibility issues with other driver managers you may have installed.
- April 19, 2016
Continuing in the series of shiny module design patterns, this post covers how to pass all the inputs from one module to another.
input from within the server call. Store the
callModule() result in a variable. Pass the variable into arguments for other modules. Access the variable like you would
input. Steal the code and, as always, if you can improve it do so!
- April 14, 2016
Following on from looking at the shiny modules design pattern of passing an input value to many modules, I’m now going to look at a more complex shiny module design pattern: passing an input from one module to another.
Return the input in a reactive expression from within the server call. Store the
callModule() result in a variable. Pass the variable into arguments for other modules. Steal the code and, as always, if you can improve it do so!
- April 12, 2016
We’re in the fantastic situation where lots of people are using Travis-CI to test their R packages or use it to test and deploy their analytics/ documentation / anything really. It’s popularity has been having a negative side-effect recently though! GitHub rate limits API access to 5000 requests per hour so sometimes there are more R related jobs running on Travis per hour than this limit, causing builds to error typically with a message that includes
This error will cause your build to fail, even if you didn’t do anything wrong. To solve it short-term you can wait a little while and restart your build.
That is a very short-termist solution and does not solve the problem for future you or other users of the service. The real solution to resolving this issue is to get off the default API access credentials and use your own.
The R integration in Travis makes good use of the devtools. The devtools package looks for an environment variable called
GITHUB_PAT that holds a personal access token (PAT) for using the GitHub API and if it doesn’t find one it uses a default token. When we get our own PAT and store it in Travis, devtools will pick up our token and use it, meaning you’ll only ever get rate limited if you do more than 5000 builds in an hour, which is an achievement I’d love to hear about.
- April 8, 2016
For the awesome Shiny Developers Conference back in January, I endeavoured to learn about shiny modules and overhaul an application using them in the space of two days. I succeeded and almost immediately switched onto other projects, thereby losing most of the hard-won knowledge! As I rediscover shiny modules and start putting them into more active use, I’ll be blogging about design patterns. This post takes you through the case of multiple modules receiving the same input value.
Stick overall config input objects at the app level and pass them in a reactive expression to
callModule(). Pass the results in as an extra argument into subsequent modules. These are reactive so don’t forget the brackets. Steal the code and, as always, if you can improve it do so!
- April 5, 2016
With my HIBPwned package, I consume the HaveIBeenPwned API and return back a list object with an element for each email address. Each element holds a data.frame of breach data or a stub response with a single column data.frame containing NA. Elements are named with the email addresses they relate to. I had a list of data.frames and I wanted a consolidated data.frame (well, I always want a data.table).
Enter data.table …
data.table has a very cool, and very fast function named
rbindlist(). This takes a list of data.frames and consolidates them into one data.table, which can, of course, be handled as a data.frame if you didn’t want to use data.table for anything else.
- April 4, 2016
As part of my never-ending quest to deploy documentation better, I’ve made yet another tweak to my scripts that deploy R vignettes or Rmarkdown documents to the
gh-pages branch of my github repositories via Travis-CI.
The script from Robert Flight that’s provided the basis for most of this work does something specific to update the web facing branch of the repository. It would:
Create a blank repository
Add the requisite files to the repository
Add and commit them to the repo
Force the repo to overwrite the
This had the unfortunate consequence of losing the history of what was previously hosted on the branch and could not tell me what commit to my development branches was responsible for a version of the docs. It took a little bit of playing but the revised script now:
Clones the gh-pages branch
Adds the requisite files into the reports
Add and commit them to the repo
Push the changes
Using an environment variable ($TRAVIS_COMMIT) the commit message is the commit ID for the latest commit in the build that occurs on Travis, making it very easy to see what changes triggered a documentation update.Read More
- March 24, 2016
- March 23, 2016
- March 21, 2016
The answer in life to the inevitable question of “How can I do that in R?” should be “There’s a package for that”. So when I wanted to query HaveIBeenPwned.com (HIBP) to check whether a bunch of emails had been involved in data breaches and there wasn’t an R package for HIBP, it meant that the responsibility for making one landed on my shoulders. Now, you can see if your accounts are at risk with the R package for HaveIBeenPwned.com, HIBPwned.
The package is currently available on github @ stephlocke/HIBPwned, but I intend to submit to CRAN after getting some feedback from y’all.Read More
- March 15, 2016
- March 14, 2016
Recently I’ve had to get to grips with SSH tunnels. SSH tunnels are really useful for maintaining remote network integrity and work in a secure fashion. It is, however, a pain to open PuTTY and log in all the time, mainly because I couldn’t script it in R! It’s been a trial, but like most things it turned out to be pretty simple in the end so I thought I’d share it with you.
- March 9, 2016
- March 8, 2016
When I’m building stuff in R like packages, models, etc. I find myself wishing for realistic looking test data without having to resort to getting data off my production server. To that end I’ve been on the hunt for a way of generating decent test data. A few months back I stumbled upon the neat system Mockaroo which provides a GUI to build some data that suits your needs.
Mockaroo is a really impressive service with a wide spread of different data types. They also have simple ways of adding things like within group differences to data so that you can mock realistic class differences. They use the freemium model so you can get a thousand rows per download, which is pretty sweet. The big BUT you can feel coming on is this – it’s a GUI! I don’t want to have spend time hand cranking a data extract.
Thankfully, they have a GUI for getting data too and it’s pretty simply to use so I’ve started making a package for it.
I’ve started the package on github and will be developing it over the next month or two. It’s up and working, but only in the most primitive way as I’d like to get some feedback from folks who might find this useful around how the interface for generating your desired data schema should work.Read More
- March 7, 2016
- March 3, 2016
- March 1, 2016
- January 13, 2016
I love my Dell XPS13. It’s fast, sleek and gorgeous. It does however have one little problem: the icon and text size. The text was always too big for the buttons and boxes and the icons were so small you could hardly see them. This made it hard to use my machine without an external screen (which doesn’t have that issue and should have been my first clue!)Read More
- January 6, 2016
- December 31, 2015
- December 23, 2015
- December 7, 2015
- November 11, 2015
Boris Hristrov, Data Platform MVP, design whizz, and all-round great guy, recently launched 356labs. Boris wrote a great Presentation Design course for PluralSight, you can sign up for a trial of PluralSight and watch the course if you’d like to find out more.
Being an avid reader of design stuff I did find I knew some of the things on the course, but the context and application were very helpful. Off the back of his course, I went on to produce my most visually impressive presentation slide deck to date – Agile BI.
I took a look over his site and asked a few questions since I was really curious. Here are the responses!Read More
- November 7, 2015
A while back, I wrote about how I was waiting to be able to release optiRum to CRAN, well data.table 1.9.6 was released (a key dependency for new functionality) and I’ve finally had some quiet time. So optiRum 1.37.1 is now accepted and trickling through the CRAN publish processes.Read More
- November 4, 2015
UPDATE: Proposal now being developed after fantastic community support. Check out satRdays on GitHub and contribute your opinions!
I had a contact from a very nice chap in Dallas a month ago about whether in the R world we do anything like SQLSaturdays.
The great thing about the SQLSaturdays he said was not that they’re free (well it helps!) but that they’re on his time. Developing his skills was something he couldn’t get signed off by his boss so he wanted to be able to do it by himself.
In answer to the question of whether there are local(ish) weekend conferences happening regularly for R, my answer was “not really” and it’s a shame because the R community is fantastic. I started thinking about why we don’t have them and what would be needed to change that.
Free / cheap regional small-medium conferences are a must for growing user knowledge and speakers in R.Read More
- October 28, 2015
- October 24, 2015
Yesterday, another Women in Technology conference got forwarded around and looking at the agenda, I snapped. I asked to not see any more goddamn WiT conferences.
I’m really fed up with women talking about being in tech. I don’t perceive any value in attending a conference dedicated to that. I want to see more women talking about doing tech.Read More
- October 16, 2015
My lightning talk was titled DataOps – it’s a thing (honest) and focused on what is essentially DevOps ported out of the developer sphere and into the data professional sphere.Read More
- July 16, 2015
- June 24, 2015
Back in April, for SQL Saturday Exeter I ran my first ever full day of training. Next month sees me taking my second tilt at it.
To sign up for my R training day, July 24th, in Manchester you can go to the pre-con homepage.
If I may say so myself, it’s a steal at £99 but then they all are! For instance, Andrew Fryer’s training day covers the Machine Learning use of R via Azure, so if you’re already wrangling numbers like a pro in R, understanding how you can apply it to snazzy webservices is a great way to go.Read More
- June 17, 2015
I’ve been producing presentations via R using rmarkdown and outputting to either ioslides or slidify. That was excellent, because I could provide a CSS that customised the look and feel (relatively) easily*.
However, when I wanted to produce a PDF version, I couldn’t make ones that look as good as the pure LaTeX versions I could make on overleaf.com. So I started RTFMing when I wanted to replicate the look and feel from my presentation, The LaTeX Show.
I didn’t want to spend a huge amount of time on it, so this little story of hack and slash may feel a bit dirty to you!Read More
- June 5, 2015
Following on from my post about the principles behind using travis-ci to commit to a
gh-pages I wanted to follow-up with how I tackled my “intermediate” use case.
Posts in this series
- Automated documentation hosting on github via Travis-CI
- Auto-deploying documentation: multiple R vignettes
- Auto-deploying documentation: FASTER!
- Auto-deploying documentation: Rtraining
- Auto-deploying documentation: better change tracking of artefacts
In my original post I show how I pushed the tfsR vignette to
gh-pages, which involved copying it and renaming it to index.html.
Unfortunately, this wouldn’t work if I had multiple vignettes that I wanted to be accessible online.
- An index.html file
A way of extracting any number of html files from the vignette folderRead More
- June 4, 2015
- June 3, 2015
optiRum, the R package I built and support for Optimum on CRAN has gained some extra functions recently. Some of it uses currently experimental data.table functionality so I’m eagerly awaiting the release to CRAN to deliver optiRum.
In the interim, I thought I’d give some brief overviews of existing functionality contained in the package.
- June 1, 2015
In this post, I’m going to cover how you can use continuous integration and source control to build and host documentation (or any other static HTML) for free, and in a way that updates every time your code changes. I’ll cover the generic capability, and then how I apply this to my simplest package, tfsR. In a later post (once I’ve cracked the best method to do it) I’ll cover my more complex use case of multiple documents and a dynamically constructed index page.
NB: This is kicked off from a post from Robert Flight about applying to the technique to R package vignettes. It’s a very useful post but it was quite specific to his situation and I wanted to understand the principles behind it before I started extending it to my more complex cases.
Posts in this series
- Automated documentation hosting on github via Travis-CI
- Auto-deploying documentation: multiple R vignettes
- Auto-deploying documentation: FASTER!
- Auto-deploying documentation: Rtraining
- Auto-deploying documentation: better change tracking of artefacts
- Must haves:
- A linux machine (so you can test your bash script that Travis-CI will run)
- R (for following the specific instructions)
- Get an OAUTH token from github
- Add OAUTH token to travis
- Add a *.sh file that gets your HTML (depending on circumstance, you may also need to generate it) and pushes to gh-pages branch
- Include your .sh file in the
after_successpart of your travis file
Commit & push!Read More
- May 25, 2015
- April 26, 2015
- April 20, 2015
With excellent guidance and tooling on making R packages, it’s becoming really easy to make a package to hold your R functionality. This has a host of benefits, not least source control (via GitHub) and unit testing (via the
testthat package). Once you have a package and unit tests, a great way of making sure that as you change things you don’t break them is to perform Continuous integration.
What this means is that every time you make a change, your package is built and thoroughly checked for any issues. If issues are found the “build’s broke” and you have to fix it ASAP.
NB – it doesn’t have to be your only remote serverRead More
- April 17, 2015
- April 16, 2015
optiRum, the R package I built and maintain for Optimum on CRAN has gained some extra functions recently. Some of it uses currently experimental data.table functionality so I’m eagerly awaiting the release to CRAN to deliver optiRum.
In the interim, I thought I’d give some brief overviews of existing functionality contained in the package.
I do a lot of regression models and one of the common tools for assessing a regression’s ability to accurately model an event is to produce a Gini chart and a Gini coefficient. The higher the Gini coefficient, the more your model is able to discriminate probability accurately.
I simplify the process of producing gini charts (
giniChart) and coefficients (
giniCoef) so that I get a chart in one simple step.
Under the hood this uses the AUC package to get the coefficient, scales to format it and ggplot2 to produce the chart. Using ggplot leads to a better looking chart that can also be tweaked to suit your needs since a ggplot object is returned by the function.Read More
- April 15, 2015
- April 13, 2015
- April 8, 2015
- April 7, 2015
- April 6, 2015
I’m working on building a snazzy shiny app that a) drops the inputs/parameter values into blob storage and b) uses Stream Analytics to query the values and present back what people are saying at the moment. This’ll be a fab tool for my pre-con next month if I can get it working in time!
Getting it working, does however mean utilising the Azure Blob Storage API in R which I confess is much harder than expected, especially after the ease of using the Visual Studio Online API for tfsR. To that end, I thought I’d write-up some of my findings before I do a bigger write-up that illustrates how to do everything (in R).
I’m working my way through an intro to azure storage on the (hopefully reasonable) expectation that more knowledge will make it easier to work with. There’s additionally the online reference, although I found the VSO REST API documentation easier to understand and get started with.Read More
- March 23, 2015
- March 20, 2015
The unholy abomination of trying to use TFS as my central repository for my R code over the past year has been tough and you may or not be looking at the screen as if I’m a crazy fool for even trying. Of course, now I have good news, because I’ve broken the back of the main issue I had with TFS. The crucial link was being able to programatically create Git repositories within a single project for small projects.
Using the API, I’ve been able to write an R package with functions that now save me at least 15 minutes of time and effort each time I want a new project. So I can happily holler “IT’S ALIVE!!”Read More
- March 14, 2015
For some people it might sound silly, but a frequent reason why people don’t sign up or don’t make it to their local user group is to do with social anxiety. I totally understand this – a room full of people you don’t know can be a daunting experience. I still get nervous when attending a new user group for the first time and I run three user groups, and speak at user groups and conferences all round the country!
This post takes you through the worries, and explains how I’ve approached some of the issues. Hopefully, this’ll help you get more people in to your local user group and learning, whether it’s because you have the tools to help yourself, or understand and can help others.Read More
- February 27, 2015
- February 24, 2015
Entering into the world of SQL Server around the same time as the 2008 release has meant that until the past couple of years, change in the Microsoft BI world only happened in dribs and drabs for me. SQL Server and it’s BI components were stable server products and the focus was on getting data and optimising “central reporting”. Recently though things have started to massively change due to Azure and Office 365.
No longer part of Server & Tools where products were considered in silos, SQL Server and BI are now part of the Cloud Platform. It’s now a means of delivering the Cloud-first vision that Microsoft have aligned themselves to.Read More
- February 22, 2015
- February 18, 2015
- February 13, 2015
- February 11, 2015
- February 9, 2015
Last year I built a pretty sweet web service in R as part of the day job. However, not being well-versed in stuff like object-oriented programming, I did not do the best job of making the flow of my program particularly clear or robust. It wouldn’t take multiple inputs properly and I found it to be tough to test. In spare moments, I took to cogitating how to improve things.
I tried simply refactoring some of the functions but found my structure too cumbersome to allow much change. I tried starting afresh with an S4 system but was soon in a death spiral of class proliferation and no experience in how to stop it. After dabbling with different methods, I was getting pretty frustrated – I want my code to be better and more maintainable!
Now I’m looking at magrittr.
This means you can more succinctly pass an input through various transformation steps (in contrast to my initial method) with a lot less code. The ability to add conditional functions or even new functions on the fly (aka lambda functions) with a similarly low code burden gives the added benefit of helping with branching logic.Read More
- February 7, 2015
Oz and I being the lazy so and so’s that we are, share a profile and use it across all our devices. Our username is “Steph & Oz” which means the user folder that Windows has for us is
C:UsersSteph & Oz. Having spaces and special characters is generally not recommended, and gives interesting issues when using R, primarily at initialization and when trying to do package installations.
By default, R will try make the user’s personal folder the directory which it works under, i.e. limiting its impact on the computer overall, but it’s Unix/Linux roots mean that it doesn’t like you doing whacky things like ampersands in folder names.
The result with ours is to cause this error on load:
Error installing package: Error: ERROR: no packages specified
‘Oz’ is not recognized as an internal or external command,
operable program or batch file.Read More
- February 6, 2015
Hot off the back of his win in the Tribal Awards, Paul is offering to mentor 3 men & 3 women for two months. To be in with a chance of getting mentored by Paul, you simply need to apply by writing a blog post about why you should be considered for mentoring and posting the link by the 15th Feb 2015.
I think it’s an awesome offer that you should take up if possible (i.e. you’re reading before the deadline) and whilst I’m busy trying to convince you I’m going to insert my application too. Hopefully, seeing my application will help you form your own.
What is the value of being mentored?
Mentoring gives you the opportunity to have someone who can assist you in the way a senior techy can when you face a technical challenge. They can give valuable advice about hidden perils, shortcuts, and point out code smells.
That advice is valuable, but to get it you need to properly formulate your issue or challenge faced. Like posting on Stack Overflow, putting thought and preparation into the question gives you a deeper understanding before you even talk to your mentor.
It’s worth noting that you can’t be vague. “I want to be the best” or “I want to know everything” is never going to happen. Mentoring is not a panacea for your entire career – especially with short duration mentoring like Paul’s. To get the value, you need to settle on a specific issue or challenge that you want to tackle.Read More
- January 24, 2015
As I covered in my post on SQLSaturday Exeter, I’m going to be doing a full day of R training on April 24th that takes you from cabin boy to first mate in a day. You can’t be captain because I’m Captain… until you go back to your own ship… then you can be captain.
Attend my day of training about R if you’d like to learn R, best practices, and how to manage it.
It’s £150 (early bird) and can be booked at SQLSaturday Exeter’s websiteRead More
- January 21, 2015
A brief history
Where I’ve been using R for the past couple of years and spent the first months struggling with it, I wanted to give a presentation that I would have wanted to see at the beginning. Not one about random bagging and a bunch of other stats but what are the best ways to do the fundamentals:
- connecting to my database
- performing data manipulations, summaries and updates
- charting my data
- producing reports
A few packages cover these awesomely and are much better than base R so whilst I was tackling a massive stats project, the things which took the time and stress were things I could have avoided with ease!
So my intro to R, takes people through the things I wish I’d been taken through thus making those first few months of R pleasant, happy times!Read More
- January 20, 2015
- January 19, 2015
Woohoo! The kind and crazy folks at SQLSaturday Exeter accepted my submitted training day for their roster. Before I wax lyrical on the virtues of being locked in a room with me all day, I thought I’d better cover the fundamentals of the event itself!
First, the awesome video…Read More
- January 18, 2015
I wanted to outline my approach to presentation design, or development as I prefer to call it.
Why do I consider it development? Well, it’s a product that can be manually done & delivered but with the potential to scale to thousands of users, I’d rather the product be easy to maintain & deploy, deliver real value to the users, and keep up with cutting edge developments in the subject. Also, I call it development because now with the use of rmarkdown, I do actually code my presentations.
General presentation design
I’ve read and studied a lot about presentations, some of the biggest influences being:
– Dr. Andrew Abela and the Extreme Presentation Method
– Buck Woody and his fantastic presentation style
– Brent Ozar and his excellent materials for presentations
– Solid fundamentals in presentation training courses (things like INTRO: Intro, Need, Title, Range, Objective)
When I first come up with the idea for a presentation, I write the abstract for it. In the abstract I set out the tone, material covered, and outline who should attend. This abstract is my requirements doc for later me – it tells me whether I’m selling, educating, or entertaining and what I’m doing it about.
In my opinion, you should always write the abstract first as not only can you write more abstracts than you can presentations but it distills the idea down and helps you think of your audience first.Read More
- January 14, 2015
Last night was the first Cardiff R User Group event. There were 6 people registered out of 24 CaRdiffians. In the end we had 8 people show up – so a whopping third of our current membership base.
As we sat around the booth eating chips and drinking beer, we covered our experiences learning R to date, the trials and tribulations of our jobs and why you shouldn’t drop a barbell on your nose. We had great discussions and most of us came away with new R functionality to look at!
We decided to initially go with the three session formats I’d proposed and see how things go:
- January 13, 2015
I spend a lot of time in Photoshop for someone in BI. Between cleaning up images, building logos for my latest project, or producing material for user groups, I probably use it at least once a week. Through it all, I usually need to produce variants, in different file formats and sizes. So it can quickly become a dozen uses of the Save As… or Save for Web functions.
I hate manual work, so you can see why it was frustrating in the extreme. Then I realised how silly I was being by not having already googled for it!
It took a while because my keyword searches weren’t the terms Photoshop use but I found the Secret Sauce. And if you’re the sort of person who’d type “photoshop image macro” – here’s how you do it!Read More
- January 12, 2015
As part of my ongoing series about presenting at community events and conferences, I wanted to cover the my personal thought process when it comes to prioritising what events I’d like to speak at for my goal Throw 1, Speak 1.
There are a massive amount of awesome SQL Server and other technology events happening out there. I even throw a SQL Server lunchtime session once a week for the user group! Then of course there’s all those conferences in the UK and abroad that are worth attending. So how do I event start picking out where I’d like to talk, and how do I go about getting selected for them?Read More
- January 11, 2015
It’s a bit sad but I enjoy dissecting what sessions are submitted to conferences I’m involved in or speak at. Instead of doing it primarily by eye, I’ve started dabbling in web scraping in R to do it. Initially, I used RCurl and my latest snippet uses rvest.
The first snippet for SQLBits bit of R code uses RCurl but it’s cumbersome, plus for SQLSaturday Exeter there is SSL to contend with. Using rvest makes it really easy and it was an excellent excuse to get around to using magrittr, Hadley Wickham’s pipe code paradigm for R.
Blogger tip: I also wanted the opportunity to see how Gists imported into WordPress – you just c&p the url in (into the code, no URL markup) and WordPress automatically pulls in the Gist. For more info on this see WordPress’ article on Gist.Read More
- January 10, 2015
- January 9, 2015
These days any hobby of mine ends up with a user group if there isn’t one already.
The amount of value I derive from being able to hear experts in their fields talk about whether they’re on stage or in the audience is phenomenal. Also, it’s really great way to meet like-minded people.
So with the benefits in mind, 2 years of R under my belt, and a new starter in work, the time seemed ripe for an R user group.Read More
- January 8, 2015
Following up from my last post on maintaining my session abstracts, I wanted to cover how I’m doing my scheduling this year for speaking at events. Perhaps more importantly than tech, is the intention and the planning process so I’ll be covering these factors in more detail than the tech.
I make use of Google services quite a bit, and their calendar system is a great help. So this year I’ve added a calendar that has all mine (and hopefully Oz’s) speaking engagements.
Throw 1, Speak 1
The goal this year is to throw one user group event and speak at one event each month.Read More
- January 2, 2015
Last year I spoke at 10 different events (I think) and was very lucky to be nominated in the Tribal Awards for my Intro to R session. I did just a couple of different session titles and I don’t think I managed the whole process very well.
To be an easier speaker to deal with, I’m trying to be more organised so that the selection process of myself & topics is easier whilst also ensuring I don’t develop too many presentations at the last minute.
Having dealt with awesome serial speakers, Tobiasz Koprowski and Denny Cherry, from the organiser end they did a few things which made it much easier to deal with them, particularly given the breadth of topics they can cover!Read More
- December 29, 2014
This year we’ll be continuing to maintain evening events on Tuesday nights and lunch time events on Thursdays.
So far we have the following events and speakers scheduled for the evening events:
- Jan 27th – 2 hour intro to replication by David Williams
- Mar 31st – Battle of the Beards! Tobiasz Koprowski vs Terry McCann vs Rob Sewell
- May 26th – Index Fragmentation: Internals, Analysis, and Solutions by Paul Randal, and Steve Powell
- Jul 28th
Alex Whittles on winning Fantasy F1 using PowerPivot
I’ve got slots in there for full hour sessions as well as lightning talks for up to half an hour long so whether you’re an existing speaker or want to improve your knowledge, please get in touch and book yourself in.Read More
- November 23, 2014
If you need to join multiple datasets inside SSRS, perhaps because of different sources, grains of detail etc, then you often need to aggregate over both datasets.
In SSRS, you can easily perform aggregations over another dataset but it can be tough to do this based on a grouping factor in your main dataset.
A key example of this might be Sales and Purchases – you want to show both of these by month but they come from two different data sources.
You could build two tables that appear to be just one table but this can be really clunky. Instead, you want just one table with the month, the total sales, and the total purchases in.
Although there’s no tidy way of doing this built in, you have the power to add your own functions to SSRS using the Code window of the report’s properties. Provided here is a block of VB script that can be added to your SSRS report to allow you to do those tricky aggregations as if they were just another built in function.
I call it AggLookup.Read More
- November 10, 2014
- November 6, 2014
- June 28, 2014
- September 25, 2013
- September 15, 2013
What is R?
R is a statistical language for doing all sorts of analytics based on many different types of data and it’s also an open source platform that allows people to extend the base functionality. More details are available from the horse’s mouth.
How can I give it a go?
Download R and RStudio an awesome development environment for R. There is also an excellent online R learning site. I do not recommend sticking with just R – we’re used to a lot more convenience and good development bits and bobs like IntelliSense and Rstudio really delivers.Read More
- September 13, 2013
Further to the last post introducing my trials and tribulations, and a hectic week or two we’ve made excellent progress on the Relay. I’ve enlisted Mark (@tsqltidy) the chair for the Relay and others to assist with the twittering and other activities which has really held me reduce my workload substantially.
All ten venues are going ahead:
|Reading||Monday 11th Nov 2013|
|Southampton||Tuesday 12th Nov 2013|
|Cardiff||Wednesday 13th Nov 2013|
|Birmingham||Thursday 14th Nov 2013|
|Hertfordshire||Friday 15th Nov 2013|
|Newcastle||Monday 25th Nov 2013|
|Manchester||Tuesday 26th Nov 2013|
|Norwich||Wednesday 27th Nov 2013|
|Bristol||Thursday 28th Nov 2013|
|London||Friday 29th Nov 2013|
So what’s been done so far?
What have I been doing to try to make this a successful marketing channel:Read More
- September 7, 2013
- September 1, 2013
After organising SQLRelay for June 24th in Cardiff, as part of the national series of 8 events. We’re gearing up for November with the aim of being able to capitalise on the growing knowledge of SQL Server 2014 CTP and pushing the Relay into a less busy part of the UK community schedule. The difficulty is that where we had more than 6 months to prep for the previous Relay, this time round we had less than 5. What this means for me, is not only do I want to run a bigger and better Cardiff event, but I also (being a glutton for punishment) took on spearheading the marketing efforts for the whole shebang.
Details will be released next week on the launch, but given my lack of knowledge about anything social media this has already been a major undertaking for me, and I thought it might be of value for me, future me, and my dear readers to compile information and learnings as I go along so that it’s easier to implement in future for other marketing endeavours. It also provides an area for discussion.Read More
- August 27, 2013
Why do I use dynamic named ranges?
Where I work, most reports are exposed via a web front-end and Excel can create an external connection and retrieve the information. This is much safer than using direct database connections in workbooks. A problem with web queries though is that they cannot be converted to Tables in order for referencing columns and the dataset as a whole to be made easier. As a result, dynamic named ranges are a necessity for producing easy to develop and manage spreadsheets since the volumes in the raw data can change over time.
How I save myself time
A raw data table with 20 columns will take a long time to create the named ranges for, given that I want:
- A dynamic range covering the headers too for pivot tables
- A dynamic range without headers for vlookups
- A dynamic range for each column without headers
I use a macro, assigned to a nice button on my ribbon, to generate all the relevant ranges.
What are the special considerations?
Structure – raw data tables should ALWAYS be set up in a specific way – with the Primary Key on the left hand side and always filled in, with no empty rows or columns
Special characters – range names can’t contain special characters. The VBA uses the RegEx functionality to strip these out.
Numbers – range names can’t have numbers either. We can’t just strip out the numbers like we would special characters because they might be important like Grade1, Grade2 and Grade3 and collapsing them all to the name Grade would be a problem. Instead, the macro converts all numbers to the corresponding letter in the alphabet.
How much the data will grow? By default I set the macro to use 10 times the number of records present when I run the macro – if it’s already bigger than 25k rows, the number will need to be reduced, and if I don’t think 10 times the number will be adequate, I’ll increase the number.Read More
- August 20, 2013
- August 16, 2013
- July 27, 2013
- June 27, 2013
- May 29, 2013
A system we need to report on that is form based. Whenever there is a new form, there is a new table, and whenever there is a new or amended* field on the form, there is a new column in the table. Maintaining the imports of this data into a staging environment would require a lot of code and time to build manually from scratch.
What is required is something that goes through the two schema for all relevant objects and updates our staging area’s schema accordingly.
Points for consideration:
- Due to the level of change in source system, all loads are dynamically generated SQL
- Loads run from a data dictionary table, which needs to be updated when we update the schema
- Loads occur daily
- May 25, 2013
- May 24, 2013
- May 19, 2013
- May 12, 2013
For my first ever blog post (be gentle with me!) I wanted to talk about an issue I have with Excel’s check box object, and my way of resolving it. It’s not perfect, and I’d love to hear of any other versions or ideas you may have. So here’s how I create check boxes in Excel without using Excel check boxes.
The Problem with Check Box Objects
They look good and they work well, there’s no denying they do what they’re supposed to, but they also annoy the heck out of me!
- May 11, 2013
This blog was configured super rapidly with goDaddy and Azure, instead of my previous implementation on EC2. I’ve forgone the multi-site installation, with attendant subdomains, and gone for a straight wordpress Website (one of the Azure features).
I already had an Azure account I’d gone through the billing setup for – but that was really simple anyway, so getting the blog up and running consisted of: