Down the Rabbit Hole with UTF-8, YAML, and RSpec

I hope this post saves someone from a similar afternoon to the one I just spent, puzzling over what appeared to be an impossible test result.

TL;DR: string.encode!(string.encoding, …) does nothing, even if string isn’t valid for string.encoding. To really force an encoding, hop through BINARY.  Regex matches on unsanitized binary data are common cases of explosions, but RSpec matchers patch =~ so an RSpec test can’t catch encoding failures.

Parts of Tddium, like many other Rails applications, use ActiveRecord object serialization for variable-field data that doesn’t need to be represented in a relational schema.  For example, a model like this:

[sourcecode lang=”ruby”]
class Service < ActiveRecord::Base
serialize :raw_messages, Array
end
[/sourcecode]

Internally, ActiveRecord serializes to YAML, which has a lot of nice properties.  For a variety of reasons, in some of our apps, we’re still using Syck to process YAML. The source of the message data that gets written to raw_messages doesn’t always reliably produce UTF-8, so, being paranoid, we sanitize both in the producer and the consumer.  In the producer, we use Ruby’s encode! method with options to replace invalid and undefined conversions, and in the consumer we drop messages that contain invalid characters. We’d always had some messages dropped, but we recently needed a dropped message for debugging, and we found that the producer’s sanitization wasn’t so zealous as we thought.

Some Background on Transcoding in Ruby

(For reference, all of this was done with ruby-1.9.2-p290, and rspec 2.6.0.)

Here’s an example of a string that claims to be UTF-8, but isn’t actually.  Ruby properly detects that it’s invalid.

[sourcecode lang=”ruby”] > Encoding.default_internal
=> #
> "axFCe".encoding
=> #
> "axFCe".valid_encoding?
=> false
[/sourcecode]

Calling encode! doesn’t appear to do anything, even when it’s told to replace characters that can’t be translated:

[sourcecode lang=”ruby”]
> s = "axFCe"
=> "axFCe"
> s.encode!
=> "axFCe"
> s.valid_encoding?
=> false
> s.encode!("UTF-8", :invalid=>:replace, :undefined=>:replace)
=> "axFCe"
> s.valid_encoding?
=> false
[/sourcecode]

The implementation of encode! is actually a no-op if the source and dest encodings are the same (in transcode.c, look for str_encode_bang, and trace into str_transcode0 around line 2614).

So, of course, blindly trying to convert an invalid-encoding string after encode! into YAML doesn’t work:

[sourcecode lang=”ruby”]
> YAML::ENGINE.yamler
=> "syck"
> { s => 1 }.to_yaml
ArgumentError: invalid byte sequence in UTF-8
from …lib/ruby/1.9.1/syck/rubytypes.rb:148:in `is_complex_yaml?’
from …lib/ruby/1.9.1/syck/rubytypes.rb:170:in `to_yaml’
[/sourcecode]

Here’s the source for that exception:

[sourcecode lang=”ruby”]
#rubytypes.rb
class String
yaml_as "tag:ruby.yaml.org,2002:string"
yaml_as "tag:yaml.org,2002:binary"
yaml_as "tag:yaml.org,2002:str"
def is_complex_yaml?
to_yaml_style or not to_yaml_properties.empty? or self =~ /n.+/
end
[/sourcecode]

But… We Had a Test For That?!?

What caused confusion is that the RSpec test we had written for the sanitizer didn’t blow up.

An (itself sanitized) example:

[sourcecode lang=”ruby”]
# spec/enc_spec.rb
require "spec_helper"

describe "Encoding" do
it "should produce YAML" do
d = { "abxFCe" => 1 }
expect { sanitize(d).to_yaml }.not_to raise_error
end
end
[/sourcecode]

Run this test, it passes, and we conclude that the sanitize method works.

[sourcecode lang=”shell”]
$ rspec spec/enc_spec.rb
.

Finished in 0.25556 seconds
1 example, 0 failures
[/sourcecode]

OK, now I’m irritated.  Let’s try a different test — this time of to_yaml itself:

[sourcecode lang=”ruby”]
# spec/enc_spec2.rb
require "spec_helper"

describe "Encoding" do
it "should explode!" do
d = { "abxFCe" => 1 }
expect { d.to_yaml }.to raise_error
end
end
[/sourcecode]

It fails. Oops.

[sourcecode lang=”shell”]
$ rspec spec/enc_spec2.rb
F

Failures:

1) Encoding should explode!
Failure/Error: expect { d.to_yaml }.to raise_error
expected Exception but nothing was raised
# ./spec/enc_spec.rb:6:in `block (2 levels) in ‘

Finished in 0.17097 seconds
1 example, 1 failure
[/sourcecode]

I don’t know why this happens. I suspect it’s that RSpec redefines =~ to wrap it with a nice error message, it somehow that makes it not explode in the way the Ruby native =~ operator does. Any ideas welcome.

A Solution

What we ended up implementing was a sanitizer that looks like:

[sourcecode lang=”ruby”]
def sanitize(s)
if s.force_encoding("UTF-8").valid_encoding?
# just hand it back if it’s already valid UTF-8
return s
else
# otherwise, drop the source encoding and start from scratch
return s.force_encoding("BINARY").encode("UTF-8",
:invalid=>:replace,
:undefined=>:replace)
end
end
[/sourcecode]

One Comment

  • apogsasis (@apogsasis) August 14, 2012 at 5:19 pm

    Thanks for the articled. Helped me to make that rspec result a little greener.

Post a Comment