Baseball on a Stick: Ordinal Out of Range Errors (Parsing Errors, Fun with Unicode)

If you use Baseball on a Stick for your PITCHf/x database purposes, you may have run into an annoying ordinal out of range error. If you dive into it, you’ll notice that the instigator is the ‘cc’ field, which contains the description of the stupid “nasty factor.”

This field has unicode characters, and despite my best efforts to convert the DB and set character sets to take utf-8 in MySQL and python, I kept getting the errors. I finally said screw it, I’m just purging that field. I never look at it. And I’ll save a few bytes, too.

Anyway,  edit the bbos\src\bbos\gameday\parser\inningParser.py file with these modifications:

atbatID = tag.parentNode.getAttribute('num')
        pitch['gameAtBatID'] = atbatID
        pitchTime = tag.getAttribute('sv_id')
        ccTag = tag.getAttribute('cc')
        
        if ccTag:
            pitch['cc'] = ""
			
        if pitchTime:
            pitch['sv_id'] = self.__getPitchTime__(pitchTime)

The ccTag portions are my additions to just strip that field. The rest is there for context.

Enjoy.

Facebook’s HipHop Engine, When to Use it, and Getting it to Work with CodeIgniter

As some of you may know, I dabble in machine learning (ML) and artificial neural net (ANN) coding to solve problems; often relating to baseball. (The Hardball Times: Getting out of the injury zone, part one) What follows is a brief confession of sorts: I still write a lot of my models and stuff in PHP. While I’ve written a bunch of scripts in Python to do anything from archiving MLB.tv games, to downloading and custom-parsing PITCHf/x data, and other data analysis / data mining tasks, PHP continues to be my go-to language. I also lean on CodeIgniter pretty heavily as a lightweight framework to handle MVC/MVA-type requests.

Why PHP and CodeIgniter

My first web-specific programming language was PHP. Prior to that, I had learned BASIC in grade school, Delphi/Pascal/VB in high school, and C/C++ in college. So a C-style language came naturally to me, and there was very little in the way of server configuration and setup, which I’m notoriously horrible at. Furthermore, I don’t consider myself a software developer, so I don’t really care what the pitchfork-wielding masses on Hacker News and Slashdot think about the language. It works.

As for CodeIgniter, I got started because when I first started off writing PHP for a contract job, I had trouble figuring out how to use CakePHP, and I switched to CodeIgniter as a random guess. Years later and a few patches submitted, and I’m still on the CI bandwagon and loving it. It works well, stays out of my way, and despite a few weird bugs, it gets the job done.

Isn’t PHP slow?

Yes. Dreadfully, actually. However, so is Ruby, and so is Python. So whatever.

And the slowness of any interpreted language has almost nothing to do with why your application has horrible latency. It is way more likely to be your coding style that’s the problem, followed close behind by the database layer and poor server configuration. Except when it isn’t – which is where I found myself a few months ago after writing large ANN/ML apps and big simulators in PHP, where the database layer wasn’t the issue. (My code could certainly stand to be a lot better, but that’s a blog article for another day. Or never, actually.)

When you’re calling millions of iterations of complex code, it might not matter that it takes a few hours to run overnight, but when you’d like it to run a bit faster in production, you start looking for answers. That was the case with two of my latest apps, so I started peeking around Facebook’s HipHop.

What is HipHop?

This is the often-out-of-date wiki on HipHop over at GitHub. It used to convert PHP into C++, which is awesome for people who cheat at software development (like me).

Here’s a quick summary of how it currently works:

  • HipHop Virtual Machine (hhvm) takes cold PHP code and turns it into bytecode/p-code and runs it in a VM
  • Which isn’t that exciting since PHP already does that
  • But hhvm ships with a JIT which turns that bytecode into machine code and executes that
  • man that’s awesome

hhvm can be called to run the optimizer (hhvm-analyze, well, kinda) ahead of time to eliminate the first cold cache reads. hhvm caches code very similar to how APC does. (Here’s a good time to show PHP vs. PHP+APC vs. PHP+HPHPc results – remember HPHPc has been deprecated and hhvm is much faster.)

Basically it makes PHP code run way faster.

Installing it and getting started

I used Ubuntu 12.04 (Precise Pangolin). As such, it’s super easy to install using APT – just follow these instructions.

After that, test your chops by downloading, installing, and configuring WordPress with hhvm.

But remember, I need it to work with CodeIgniter.

Getting it to work with CodeIgniter

The problem is that hhvm uses a weird web server configuration setup that is kinda like apache, but not really. So getting rewriting to work like it does in apache is not exactly straight-forward.

Fortunately, I stumbled upon someone’s nginx-as-proxy methodology that was nearly complete. Here’s what you need to do to get it working:

Install nginx

Fortunately this is pretty simple on Ubuntu 12.04. Just run this command:

sudo apt-get install nginx

Unlike apache, however, nginx doesn’t automatically start when the server boots. So be sure to start/stop/restart it manually.

sudo service nginx start|stop|restart

Configure nginx

You need to set up VirtualHosts to get your server to properly route requests. As such, you should copy the basic vhost file:

sudo cp /etc/nginx/sites-available/default /etc/nginx/sites-available/mysitenamehere.com

Then edit the file (suck it vi lovers):

sudo nano /etc/nginx/sites-available/mysitenamehere.com

In the file, your server config should look something like this:

server {
	server_name mysitenamehere.com;
	root /var/www/CI_DIR_HERE;
	index index.php index.html index.htm;
	proxy_redirect off;
	# Copy request_uri to variable $myuri before processing
	set  $myuri  $request_uri;
	location / {
		# Check if a file exists, or route it to index.php.
		try_files $uri $uri/ /index.php;
	}
	# Send *.php to Hiphop hhvm
	location ~ .php$ {
		proxy_set_header MY_SCRIPT $myuri;
		proxy_pass http://127.0.0.1:9000;
	}
}

Here’s the kicker: Only the directory CI_DIR_HERE is going to work properly with requests. It should be pointed at the root CI directory where your index.php file resides.

This sets up nginx to route PHP files to port 9000, which is where hhvm is going to take over.

Configure hhvm

Now we need to configure hhvm to work in tandem with nginx.

sudo nano /etc/hhvm.hdf

Here’s what my config looks like:

Server {
Port = 9000
SourceRoot = /var/www/PATH_TO_CI
}

AdminServer {
  Port = 8080
  Password = lolRyanHowardsContract
}

Eval {
  Jit = true
}

Log {
  Level = Error
  UseLogFile = true
  File = /var/log/hhvm/error.log
  Access {
    * {
      File = /var/log/hhvm/access.log
      Format = %h %l %u %t "%r" %>s %b
    }
  }
}

VirtualHost {
* {
Pattern = .*
}
}

This allows hhvm to listen on port 9000 for web requests and 8080 for administration purposes (cURL localhost:8080 to see options). Again, it points at /var/www/PATH_TO_CI as the CI root (what can I say, I like the apache-style two-level directory).

Configure CodeIgniter

This one’s easy, fortunately. Just open /application/config/config.php and change these lines:

$config['index_page'] = ”;
$config['uri_protocol'] = ‘HTTP_MY_SCRIPT’;

Set up a firewall/iptables/UFW

I hate server config, I suck at it, and I am not one to give advice on it. So read this DigitalOcean guide about it.

My setup in a nutshell:

  • i5-2500k based desktop with a lot of RAM and some SSDs
  • VMWare Workstation running my Ubuntu 12.04 distro with 4 GB RAM and 2 cores assigned to it
  • Bridged networking on the VM
  • dd-wrt compatible router; I DMZ’d the IP address of the VM
  • Set up UFW, fail2ban, iptables in what I’m sure is not the best configuration
  • Using no-ip as my dynDNS provider (free if you don’t mind a stupid domain)

It seems to work fine for me. Hopefully that helps everyone out as well.

Lastly, what hhvm/hiphop is good for

It’s awesome if you are stubborn like me and refuse to write in compiled languages for CPU-intensive work. It’s also really nice if you want to marry that laziness/stubbornness with the ability to use PHP’s good frameworks and rapid application development methods. This is an area where PHP is way, way ahead of Ruby and Python. It’s not disputable, really, though there are plenty of language evangelists who will pimp out Rails (not easy to configure for non-web devs, sorry) or Django (don’t get me started).

It might be good for a WordPress install that gets a ton of traffic. I don’t really know, my sites get a few thousand hits per day and I haven’t converted them (though I played around with using Redis as an object cache for WP, which was kinda neat though I ultimately abandoned it and just stuck with my APC-based code cache methods).

It’s definitely not necessary for CRUD-style apps where the bottleneck is not the CPU. To specifically state: Unless you are blocked at the CPU level, hiphop/hhvm isn’t for you. And even if you are, learning to parallelize your code should be the first thing you do (if it can be done; often you will have methods that require backwards knowledge). Contrary to popular belief, multithreading can be done in PHP using pcntl_fork() or even specialized classes for it.

But heck, it’s kinda fun to use. So don’t let me tell you what you should or shouldn’t do with it. It’s come a long way since it was HyperPHP/HPHPc, and it’s been fun keeping up with it. Give it a run.

Fun With Past Predictions, Featuring Dave Cameron (#6org)

Dustin Ackley is really bad.

But I’m getting ahead of myself. Let’s start with the source of this post:

Ackley - Oh Really?

Hmm…

Dave didn’t respond to my last comment, though a lot of Fangraphs readers chimed in to let me know how stupid I was.

Good read, gentlemen.

Delusions of Grandeur: The Dustin Ackley Story

Enough snark. Let’s do a little research, shall we? I tweeted Dustin Ackley’s minor league equivalencies (MLEs) for his minor league performance last night:

If you combine that with the 2011 MLB (Seattle) line of .273/.348/.417 over 376 PA, a reasonable 2012 projection using a standard weighting system (something like 5/3/2 or 5/2/1 depending on the components) would give you a park-adjusted value of .257/.338/.396. Even if you considered his collegiate hitting statistics, you would have to seriously worry about the fact that he absolutely crushed out-of-conference pitchers but was merely just very good against ACC competition and that he had a larger-than-usual negative split against left-handed pitching than most left-handed batters displayed.

Regardless, even I would agree that .257/.338/.396 is not a reasonable expectation for 2012 Dustin Ackley and would adjust upwards to some degree.

Of course, in 2012 over 668 PA, he hit .226/.294/.328, which was embarrassing on all counts. The wild card here is the Seattle Mariners player development group, who I believe are ruining position players. It doesn’t take a genius to look back at the prospects they’ve had that have vastly underperformed – and not just guys who didn’t cut the mustard, but total flops from guys who were “can’t miss” dudes like Montero, Smoak, Ackley, Saunders, etc.

You can hear Wedge talking about how he hates strikeouts and how more contact has to be made. Dustin Ackley did just that in 2012 – he improved his contact percentages and struck out less. What happened? His walk rate went to shit, his LD% dropped like a rock, he hit a ton of weak ground balls, and his isolated power vanished.

Jack Z

Great job, #6org player development – Ackley did exactly what you told him to do. Did you think he would strike out less, make more contact, and somehow retain his power?

But if we look back with the power of hindsight, what can we really conclude about Ackley? His entire career in professional has been lackluster outside of a small sample in 2011 – and the underlying numbers were worrisome to those who were willing to think critically. He struck out way too much (21%) for a guy who isn’t a big time power hitter (.144 ISO, only .185 in AAA that year), and his strikeout-to-walk ratio violently flipped when he hit the big leagues.

Absurd Expectations

Dustin Ackley simply hasn’t been a very good professional baseball player, and likely needed far more time in Tacoma (something I said to many people back in 2011) to make the necessary to make adjustments before coming up for a full-time job. Batted ball data suggested he wasn’t tearing the cover off the ball in AA-AAA before his call up – he was hitting a slightly below-average amount of line drives and an abnormally above-average amount of ground balls against PCL pitchers.

The hype the Mariners blogosphere around Dustin Ackley massively distorted the most likely outcomes for him. While he’s been even worse than what a reasonable projection would have him at, there was no denying that his minor league statistics showed a real problem and that the adjustment from college to professional baseball hadn’t been successfully made.

The $400,000 Mistake – Eric Wedge

Christopher Long had a nice tweet the other day on the relative worth of an MLB analyst in terms of in-game management gaffes:

Well, alright. I will do the very basic back-of-the-envelope math. But before I do that, I want you to read this first:

We’ll come back to this.

The Situation

In the bottom of the 9th inning, the score was tied 1-1 and the Seattle Mariners had runners on 1st and 2nd. Jason Bay was running at 2nd base for Kendrys Morales (not exactly the biggest speed upgrade, I might add) and Mike Morse was at 1st base after walking.

Phil Coke came in to pitch against Raul Ibanez. Here are two things you need to know about those players:

  • Coke is really good against lefties and laughably bad against righties.
  • Ibanez is good against righties and awful against lefties.

Coke is left-handed, if you didn’t know.

Earlier in the game, Wedge absolutely incinerated his bench, leaving only the right-handed hitting backup catcher Kelly Shoppach on the bench. Unwilling to bring him in to hit in this high-leverage situation (and have him possibly play outfield), he left Ibanez in to hit.

Wedge also allowed Ibanez to swing away against Coke.

Did I mention that Coke also utterly humilated Ibanez in the playoffs last year? Because he did. And that Ibanez is not exactly the fastest runner in the world? Because he’s not.

Anyway, Raul Ibanez grounds into a double play, Justin Smoak strikes out, and they leave the winning run at third base. They go on to lose in some ridiculous fashion that involves Smoak being thrown out at home, but I don’t care about that.

Just how bad was the decision to not have Ibanez bunt? Let’s make the following assumptions:

  • If Ibanez successfully gets the bunt fair, runners will advance to 2nd/3rd and he will be thrown out at 1st
  • The two scenarios are equally likely: Ibanez grounds into a double play if he swings away and Ibanez bunt fouls out or bunts and the runners do not advance but an out is recorded

Lets just ignore the second point as well as the fact that Ibanez hasn’t successfully sacrifice bunted since 2003 – he’s a professional hitter and one would hope he can lay down a bunt a reasonable amount of the time when the situation calls for it. Especially when the expected outcome of him swinging away against Coke is really bad.

Using the 1969-1992 run expectancy matrix, runners at 1B/2B with 0 outs with an average hitter/pitcher confrontation scores 1+ runs about 63.2% of the time. Assuming a successful sacrifice bunt, runners at 2B/3B with 1 out with an average hitter/pitcher confrontation scores 1+ runs about 67.8% of the time. So in an average situation, bunting there nets a +4.6% increase in run/win expectancy. (Run expectancy = Win expectancy here since it’s the bottom of the ninth in a tie game.)

But Raul Ibanez vs. Phil Coke is not an average matchup. It’s one where Coke is massively, massively favored. Maybe the chance that 1+ runs scores in this situation with a swing away approach is really something like 60% – and in reality, it’s probably way worse. But let’s be conservative and assume it’s 60%, giving up a few points because Ibanez hasn’t shown himself to be a very good bunter in the past. Justin Smoak would then come to the plate. Smoak is better against RHP, but the difference is not big. However, Coke is way worse against RHB than the average LHP, so Smoak probably becomes something like an average hitter in this situation. Let’s call it 68% chance of scoring the run in the 2B/3B with 1 out situation.

So, by not bunting Ibanez, Wedge was giving up 8% in win expectancy.

Wedge

8% Win Expectancy Sacrificed

What’s that mean, financially? We’ll make another assumption about the value of wins – that they are worth $5MM on the open market. That’s probably a bit low, but it’s a fine guess.

Simply put, that means that Eric Wedge lit $400,000 on fire by not bunting Raul Ibanez.

Now that’s not exactly true, since the Mariners aren’t a playoff team and therefore the utility of wins isn’t exactly equal to their cost in such a situation (the Mariners adding a bunch of 1B/DH types who can’t put them to 88+ expected wins is a stupid move that’s worth a different blog post), but that is a good idea on just how brilliant that move was.

Dear all MLB teams: I’ll work for far less than $400,000 per game. And if I get your pitchers throwing harder, that’s worth a few million bucks. Just saying.

Covering Up Ubaldo’s Velocity Loss

Paul Hoynes (beat writer for the Cleveland Plain Dealer) said this about the game on April 8th:

Sounds reasonable.

Here’s what the linked article on cleveland.com said:

Turns out there was a problem with the PITCHf/x program at Progressive Field on Monday that affected the velocity readings that appeared on the ballpark signboards. The readings affected every pitcher who appeared in the game, not just Jimenez.

Reportedly, the ballpark scoreboard was 2.6 mph to 2.8 mph lower than what the velocity pitchers were actually throwing at according to radar guns behind the plate.

This has happened before in other parks, so it’s a totally plausible theory. The only problem is that Hoynes is absolutely wrong.

Hiroki Kuroda pitched against Ubaldo Jimenez. Here are his Brooks Baseball PITCHf/x average velocities:

April 8th (the slow gun day): 90-91 MPH (link)
April 3rd (vs. BOS): 90-91 MPH (link)

So you’re telling me Kuroda was throwing 92-93 instead? Let’s look at some other pitchers from that game.

Matt Albers, April 8th: 92 MPH
Matt Albers, April 6th: 93.5 MPH (+1.5 MPH)

Albers seems to have been throwing a bit harder, but he was also on one day’s rest in the April 8th game.

Joba Chamberlain, April 8th: 93.8 MPH
Joba Chamberlain, April 6th: 92.5 MPH (-1.3 MPH)

Joba threw harder in the “slow gun” day. So apparently Joba is throwing something like 96-97 MPH again? Definitely something to tell the Yankees’ brass, I’m sure they’d be very happy.

Let’s recap exactly what Ubaldo’s velocity loss was:

Ubaldo Jimenez, April 8th: 11 fastballs at 90.3 MPH, 51 “changeups” at 84.8 MPH.
Ubaldo Jimenez, April 3rd: 28 fastballs at 93.1 MPH, 23 changeups at 83.7 MPH.

So you’re telling me Ubaldo threw 51 changeups? Doesn’t seem likely. His h-break/v-break for his changeup on April 8th wasn’t even close to the same pitch on April 3rd. Compare for yourself using the links above – on April 8th it was -5.34/+8.06, on April 6th it was -6.36/+5.17.

You know what a pitch with -5.34/+8.06 break is most similar to in Ubaldo’s pitch selection? A four-seam fastball.

And you want to see the first pitch of the game? What does Carlos Santana clearly signal for? And what does his grip look like?

Ubaldo 84 MPH Fastball

Someone is covering up Ubaldo’s struggles. I don’t know if it’s Hoynes or the Indians or some other source, but the idea that the PITCHf/x system was responsible for problems is total bullshit. Covering up these kinds of problems doesn’t solve anything.

Ubaldo needs help. I’ve written about his mechanics at length in the following articles:

Lying about the situation isn’t going to help anyone. It’s time to face facts.