Sonntag, 1. Februar 2015

Merge branches with different file encodings in git

The problem

We have changed the default encoding of all our source files from system default (cp1252) to UTF-8. But recently we saw a problem, when merging changes from the old support branch using cp1252 back into master. The merge went well without any merge conflict, but the file was corrupt afterwards.

Let's create a minimal example project consisting of only one file test.properties in encoding cp1252 with following content:
ae=ä
ue=ü
oe=ö
a=a
b=b
c=c
d=d
e=e
and commit this file. Now we create a support branch to maintain the stable release. Development goes on on master. Because we want everything to be UTF-8 now, we convert the file to UTF-8:
iconv -f cp1252 -t utf-8 test.properties
and commit the changed file.

Meanwhile, on the support branch, we need to add a new key to the properties file and now the file looks like this, still in cp1252:
ae=ä
ue=ü
oe=ö
a=a
b=b
c=c
d=d
e=e
sz=ß
and commit the change.

Back on master, we want to merge the changes made on the support branch back in
git merge support
This does seem to work, git merges everything just fine without error. But now the file looks like this
ae=ä
ue=ü
oe=ö
a=a
b=b
c=c
d=d
e=e
sz=ß
when opened with cp1252. Or like this
ae=ä
ue=ü
oe=ö
a=a
b=b
c=c
d=d
e=e
sz=�
when opened with UTF-8.

As you can see, the new line from the support branch, which was in cp1252, got merged in as is, rendering the resulting file unusable, because now part of the file is in UTF-8, part in cp1252. And remember, git did not warn us, the merge seemed to work well. Let's roll back:
git reset --hard HEAD~

The solution

You can tell git to convert files on check in and check out. This can for example be used to have files in the workspace with Windows line endings (crlf), but with Linux line endings (lf) in the repository. This can either be done globally by setting core.autocrlf or core.safecrlf to true, or by creating a .gitattributes file in the root of the repository.

Let's use the .gitattributes file. Our .gitattributes file on the master branch will look like this:
*.properties text filter=convert2utf8
 This will set line endings of all properties file to lf in the repository and native (crlf when using Windows, lf when using Linux) in the workspace. In addition it configures a filter named convert2utf8. This filter must be defined on each developer machine:
git config --global filter.convert2utf8.clean convert2utf8
git config --global filter.convert2utf8.smudge cat
with following script at ~/bin/convert2utf8 (and ~/bin added to my PATH)
#!/bin/bash

# create tempfile and write content of stdin into this file
tempfile=`mktemp`
cat > $tempfile
# convert the file to utf-8 and write the result to stdout
sourceencoding=`file -bi $tempfile | sed -r "s/.*charset=(\w*)/\1/"`
iconv -f $sourceencoding -t utf-8 $tempfile

# remember exit value
exitvalue=$?

# remove the tempfile again
rm $tempfile

# return exit value of iconv, not exit value of rm
exit $exitvalue
What does this do? Everytime I commit changes to *.properties files in this project, the convert2utf8 script is executed. This ensures I do not end up having a file encoded in cp1252 in the repository.

But what about our problem of merges rendering our file unusable?
git merge support -Xrenormalize
produces the following file:
ae=ä
ue=ü
oe=ö
a=a
b=b
c=c
d=d
e=e
sz=ß
Bingo!

What happened here?

Git did execute convert2utf8 on both branches before trying to merge the changes. Although the changes on the support branch were made using cp1252, git will merge in a version converted to UTF-8.

I can even instruct git to always renormalize:
git config --global merge.renormalize true
From now on, all merges from the old branches to the new branch will work correctly. This will also prevent merge conflicts, when the old branch committed files using crlf, and now all files are committed with lf only. You could also reformat all your sourcecode using a new code formatting rule, as long as there exists a command line tool to format the file.