If you use Baseball on a Stick for your PITCHf/x database purposes, you may have run into an annoying ordinal out of range error. If you dive into it, you’ll notice that the instigator is the ‘cc’ field, which contains the description of the stupid “nasty factor.”
This field has unicode characters, and despite my best efforts to convert the DB and set character sets to take utf-8 in MySQL and python, I kept getting the errors. I finally said screw it, I’m just purging that field. I never look at it. And I’ll save a few bytes, too.
Anyway, edit the bbos\src\bbos\gameday\parser\inningParser.py file with these modifications:
atbatID = tag.parentNode.getAttribute('num')
pitch['gameAtBatID'] = atbatID
pitchTime = tag.getAttribute('sv_id')
ccTag = tag.getAttribute('cc')
pitch['cc'] = ""
pitch['sv_id'] = self.__getPitchTime__(pitchTime)
The ccTag portions are my additions to just strip that field. The rest is there for context.
I recently wrote a mass geolocation script in PHP since Google seems to be taking down all the websites that offer this vital service. Right now it:
- Checks a table for records without lat/long coordinates
- Sends those records’ addresses to Google’s geolocation API
- Updates your records with the lat/long coordinates
It checks for the “#” sign and handles it since Google has problems with translating “2020 Anywhere Place #200″ as a real address, but there are probably other errors I haven’t adequately checked for.
At the moment it updates records that are missing lat/long coordinates and writes zeroes into those fields if it can’t geolocate the address correctly. As a result, if you run the script again, it will assume those records were updated correctly and will pass over them, since they are not null. This is a known issue and is one (of many) areas of improvement for the script.
I didn’t want to spend a ton of hours on the script to make it a full-fledged awesome service, since Google doesn’t seem to like that. It’s just a basic script that handles common problems and works well enough.
You can get it over at my GitHub repo page for GeoCode. Enjoy.
I’ve launched yet another half-complete web application relating to baseball sabermetrics! ML Splits is a database of minor league baseball players (batters only for now) that shows their “splits” (performance against LHP and RHP) as well as park effects and major league equivalencies. The data is taken from Jeff Sackmann’s old minorleaguesplits.com site where he made the CSVs available for import as open source.
I leaned on jQuery for front-end display purposes, as I’m getting more and more comfortable with using it for front-facing web applications. Mostly just dynamic div tagging and toggle() to keep the screen clear of distractions and make it easy to see what stats are really important. Wrote the entire thing in PHP 5.3.x, MySQL 5, CodeIgniter 2.0, and jQuery.
EDIT: Pitchers are up as of April 4th.
I’ve started to use Springloops RC2 for my version control efforts, which is really awesome. They support SVN and git (no Mercurial yet), have a great interface with easy deployments to multiple servers so you can split them up into production/staging/development, a solid ticketing system, and a good code browser. Also, it’s completely free!
HeidiSQL has been my Windows MySQL GUI of choice ever since I switched to it from Toad. It’s generally very good, it’s FOSS, and it handles most operations fairly well – except for CSV importing. Site5 is my current webhost, and I’m very happy with them 99.9% of the time, but they caused me a bit of pain recently. I have a lot of CSVs to import that are about 10-15 MB in size each, and I tried importing them through the very handy phpmyadmin tool. However, phpmyadmin has some memory leaks and issues with importing larger CSVs, so it ended up crashing due to memory problems on the larger files (even though it theoretically can handle up to 105MB CSVs). I sent Site5 a ticket to have them import the CSVs manually (they said they would), but they responded with “there’s no table structure in your DB so we can’t do it.” Well, uh, that’s what phpmyadmin does and why I wanted to use it over the 7 tables I need to make with 20 or so CSV files. They refused again without providing me an alternative, so I had do it myself.
I fired up HeidiSQL, manually created the tables, and imported the CSV. And… it didn’t work. I ignored the first row (column headers), but it still responded with “invalid data.” The fields were correctly cast for the rest of the data, so I didn’t know what was up. A bit of Googling tells me that even if you ignore the first row, HeidiSQL still checks it against the data types in your table. This is idiotic and annoying for any number of reasons, all of which I leave up to you to figure out.
At any rate, deleting the first row and ignoring 0 rows ended up working just fine. Now to do this repetitive task over and over again…
If you’re trying to build a PITCHf/x database like I am, you are probably heavily leaning on Mike Fast’s work. It’s a great primer, however, that’s exactly how it should be viewed: A primer. There are numerous things that have changed since he initially wrote the page, and it’s not a simple copy/paste job to download all the data from MLBAM. Among the issues are:
- Inadequate handling of timeouts from gd.mlb.com and gd2.mlb.com
- Hardcoding the IP is not a valid suggestion and causes lots of problems (suggested to get around DNS resolution issues)
- Parser script does not know how to figure out which games have been used without manually querying each at bat (expensive and unnecessary)
- Database structure is not future-proof
I could write for hours on the first three (plus other bugs), but I didn’t document them well enough and it’s simply not that interesting to me to write about script bugs that people should learn how to fix for themselves. If used verbatim, Mike’s database structure will fail around July 2010 when inputting new pitches and at-bats into the database. You’ll get errors like the following:
Yes, this is my Windows box.
I’ve inserted a bunch of print statements to help me debug the code (ah, printf-style debugging) and saw that it was reporting duplicate or unknown key entries. Recalling the little I know about MySQL and the numbers involved (513425 and later 1900131), I was pretty sure that this is typical behavior when a memory space is overloaded. In this case, Mike uses MEDIUMINT(8) to describe the primary key of ab_id in the code.
If you have phpmyadmin (and you should), you can fix this problem rather easily by editing the structure of the table to change ab_id to INT(10) in both the pitches and atbats tables. Additionally, you need to change pitch_id in pitches to an INT(10). This will allow for larger numbers to be stored in those rows.
phpmyadmin - Where to edit your PITCHf/x database (click for larger)
If you don’t use phpmyadmin, the MySQL code to execute at the command line is something like:
ALTER TABLE pitches MODIFY ab_id INT(10);
Good luck with your PITCHf/x database building!