Discussion:
[RFC] Unicode/UTF-8 support for XFS
Ben Myers
2014-09-11 20:37:35 UTC
Permalink
Hi,

I'm posting this RFC on Olaf's behalf, as he is busy with other projects.

First is a series of kernel patches, then a series of patches for
xfsprogs, and then a test.

Note that I have removed the unicode database files prior to posting due
to their large size. There are instructions on how to download them in
the relevant commit headers.

Thanks,
Ben

Here are some notes of introduction from Olaf:

-----------------------------------------------------------------------------
Unicode/UTF-8 support for XFS

So we had a customer request proper unicode support...


Design notes.

XFS uses byte strings for filenames, so UTF-8 is the expected format for
unicode filenames. This does raise the question what criteria a byte string
must meet to be UTF-8. We settled on the following:
- Valid unicode code points are 0..0x10FFFF, except that
- The surrogates 0xD800..0xDFFF are not valid code points, and
- Valid UTF-8 must be a shortest encoding of a valid unicode code point.

In addition, U+0 (ASCII NUL, '\0') is used to terminate byte strings (and
is itself not part of the string). Moreover strings may be length-limited
in addition to being NUL-terminated (there is no such thing as an embedded
NUL in a length-limited string).

Based on feedback on the earlier patches for unicode/UTF-8 support, we
decided that a filename that does not match the above criteria should be
treated as a binary blob, as opposed to being rejected. To stress: if any
part of the string isn't valid UTF-8, then the entire string is treated
as a binary blob. This matters once normalization is considered.

When comparing unicode strings for equality, normalization comes into play:
we must compare the normalized forms of strings, not just the raw sequences
of bytes. There are a number of defined normalization forms for unicode.
We decided on a variant of NFKD we call NFKDI. NFD was chosed over NFC,
because calculating NFC requires calculating NFD first, followed by an
additional step. NFKD was chosen over NFD because this makes filenames
that ought to be equal compare as equal. My favorite example is the ways
"office" can be spelled, when "fi" or "ffi" ligatures are used. NFKDI adds
one more step of NFKD, in that it eliminates the code points that have the
Default_Ignorable_Code_Point property from the comparison. These code
points are as a rule invisible, but might (or might not) be pulled in when
you copy/paste a string to be used as a filename. An example of these is
U+00AD SOFT HYPHEN, a code point that only shows up if a word is split
across lines.

If a filename is considered to be binary blob, comparison is based on a
simple binary match. Normalization does not apply to any part of a blob.

The code uses ("leverages", in corp-speak) the existing infrastructure for
case-insensitive filenames. Like the CI code, the name used to create a
file is stored on disk, and returned in a lookup. When comparing filenames
the normalized forms of the names being compared are generated on the fly
from the non-normalized forms stored on disk.

If the borgbit (the bit enabling legacy ASCII-based CI) is set in the
superblock, then case folding is added into the mix. This normalization
form we call NFKDICF. It allows for the creation of case-insensitive
filesystems with UTF-8 support.

-----------------------------------------------------------------------------
Implementation notes.

Strings are normalized using a trie that stores the relevant information.
The trie itself is part of the XFS module, and about 250kB in size. The
trie is not checked in: instead we add the source files from the Unicode
Character Database and a program that creates the header containing the
trie.

The key for a lookup in the trie is a UTF-8 sequence. Each valid UTF-8
sequence leads to a leaf. No invalid sequence does. This means that trie
lookups can be used to validate UTF-8 sequences, which why there is no
specialized code for the same purpose.

The trie contains information for the version of unicode in which each
code point was defined. This matters because non-normalized strings are
stored on disk, and newer versions of unicode may introduce new normalized
forms. Ideally, the version of unicode used by the filesystem is stored in
the filesystem.

The trie also accounts for corrections made in the past to normalizations.
This has little value today, because any newly created filesystem would be
using unicode version 7.0.0. It is included in order to show, not tell,
that such corrections can be handled if they are added in future revisions.

The algorithm used to calculate the sequences of bytes for the normalized
form of a UTF-8 string is tricky. The core is found in utf8byte(), with an
explanation in the preceeding comment.

The non-XFS-specific supporting code is in separate source files, and be
put in some other location in the Linux kernel source tree, if desired.
These functions have the prefix 'utf8n' if they handle length-limited
strings, and 'utf8' if they handle NUL-terminated strings.
-----------------------------------------------------------------------------
Ben Myers
2014-09-11 20:40:10 UTC
Permalink
From: Olaf Weber <***@sgi.com>

Change the XFS case-insensitive lookup code to return the first match
found, even if it is not an exact match. Whether a filesystem uses
case-insensitive lookups is determined by a superblock bit set during
filesystem creation. This means that normal use cannot create two files
that both match the same filename.

Signed-off-by: Olaf Weber <***@sgi.com>
---
fs/xfs/libxfs/xfs_dir2_block.c | 17 +++------
fs/xfs/libxfs/xfs_dir2_leaf.c | 37 ++++----------------
fs/xfs/libxfs/xfs_dir2_node.c | 79 ++++++++++++++++--------------------------
fs/xfs/libxfs/xfs_dir2_sf.c | 8 ++---
4 files changed, 45 insertions(+), 96 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_dir2_block.c b/fs/xfs/libxfs/xfs_dir2_block.c
index 9628cec..990bf0c 100644
--- a/fs/xfs/libxfs/xfs_dir2_block.c
+++ b/fs/xfs/libxfs/xfs_dir2_block.c
@@ -725,28 +725,21 @@ xfs_dir2_block_lookup_int(
dep = (xfs_dir2_data_entry_t *)
((char *)hdr + xfs_dir2_dataptr_to_off(args->geo, addr));
/*
- * Compare name and if it's an exact match, return the index
- * and buffer. If it's the first case-insensitive match, store
- * the index and buffer and continue looking for an exact match.
+ * Compare name and if it's a match, return the
+ * index and buffer.
*/
cmp = mp->m_dirnameops->compname(args, dep->name, dep->namelen);
- if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) {
+ if (cmp != XFS_CMP_DIFFERENT) {
args->cmpresult = cmp;
*bpp = bp;
*entno = mid;
- if (cmp == XFS_CMP_EXACT)
- return 0;
+ return 0;
}
} while (++mid < be32_to_cpu(btp->count) &&
be32_to_cpu(blp[mid].hashval) == hash);

ASSERT(args->op_flags & XFS_DA_OP_OKNOENT);
- /*
- * Here, we can only be doing a lookup (not a rename or replace).
- * If a case-insensitive match was found earlier, return success.
- */
- if (args->cmpresult == XFS_CMP_CASE)
- return 0;
+ ASSERT(args->cmpresult == XFS_CMP_DIFFERENT);
/*
* No match, release the buffer and return ENOENT.
*/
diff --git a/fs/xfs/libxfs/xfs_dir2_leaf.c b/fs/xfs/libxfs/xfs_dir2_leaf.c
index a19174e..3d572ee 100644
--- a/fs/xfs/libxfs/xfs_dir2_leaf.c
+++ b/fs/xfs/libxfs/xfs_dir2_leaf.c
@@ -1226,7 +1226,6 @@ xfs_dir2_leaf_lookup_int(
xfs_mount_t *mp; /* filesystem mount point */
xfs_dir2_db_t newdb; /* new data block number */
xfs_trans_t *tp; /* transaction pointer */
- xfs_dir2_db_t cidb = -1; /* case match data block no. */
enum xfs_dacmp cmp; /* name compare result */
struct xfs_dir2_leaf_entry *ents;
struct xfs_dir3_icleaf_hdr leafhdr;
@@ -1290,46 +1289,22 @@ xfs_dir2_leaf_lookup_int(
be32_to_cpu(lep->address)));
/*
* Compare name and if it's an exact match, return the index
- * and buffer. If it's the first case-insensitive match, store
- * the index and buffer and continue looking for an exact match.
+ * and buffer
*/
cmp = mp->m_dirnameops->compname(args, dep->name, dep->namelen);
- if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) {
+ if (cmp != XFS_CMP_DIFFERENT) {
args->cmpresult = cmp;
*indexp = index;
- /* case exact match: return the current buffer. */
- if (cmp == XFS_CMP_EXACT) {
- *dbpp = dbp;
- return 0;
- }
- cidb = curdb;
+ *dbpp = dbp;
+ return 0;
}
}
ASSERT(args->op_flags & XFS_DA_OP_OKNOENT);
- /*
- * Here, we can only be doing a lookup (not a rename or remove).
- * If a case-insensitive match was found earlier, re-read the
- * appropriate data block if required and return it.
- */
- if (args->cmpresult == XFS_CMP_CASE) {
- ASSERT(cidb != -1);
- if (cidb != curdb) {
- xfs_trans_brelse(tp, dbp);
- error = xfs_dir3_data_read(tp, dp,
- xfs_dir2_db_to_da(args->geo, cidb),
- -1, &dbp);
- if (error) {
- xfs_trans_brelse(tp, lbp);
- return error;
- }
- }
- *dbpp = dbp;
- return 0;
- }
+ ASSERT(args->cmpresult == XFS_CMP_DIFFERENT);
+
/*
* No match found, return -ENOENT.
*/
- ASSERT(cidb == -1);
if (dbp)
xfs_trans_brelse(tp, dbp);
xfs_trans_brelse(tp, lbp);
diff --git a/fs/xfs/libxfs/xfs_dir2_node.c b/fs/xfs/libxfs/xfs_dir2_node.c
index 2ae6ac2..1778c40 100644
--- a/fs/xfs/libxfs/xfs_dir2_node.c
+++ b/fs/xfs/libxfs/xfs_dir2_node.c
@@ -679,6 +679,7 @@ xfs_dir2_leafn_lookup_for_entry(
xfs_dir2_data_entry_t *dep; /* data block entry */
xfs_inode_t *dp; /* incore directory inode */
int error; /* error return value */
+ int di = -1; /* data entry index */
int index; /* leaf entry index */
xfs_dir2_leaf_t *leaf; /* leaf structure */
xfs_dir2_leaf_entry_t *lep; /* leaf entry */
@@ -709,6 +710,7 @@ xfs_dir2_leafn_lookup_for_entry(
if (state->extravalid) {
curbp = state->extrablk.bp;
curdb = state->extrablk.blkno;
+ di = state->extrablk.index;
}
/*
* Loop over leaf entries with the right hash value.
@@ -734,28 +736,20 @@ xfs_dir2_leafn_lookup_for_entry(
*/
if (newdb != curdb) {
/*
- * If we had a block before that we aren't saving
- * for a CI name, drop it
+ * If we had a block, drop it
*/
- if (curbp && (args->cmpresult == XFS_CMP_DIFFERENT ||
- curdb != state->extrablk.blkno))
+ if (curbp) {
xfs_trans_brelse(tp, curbp);
+ di = -1;
+ }
/*
- * If needing the block that is saved with a CI match,
- * use it otherwise read in the new data block.
+ * Read in the new data block.
*/
- if (args->cmpresult != XFS_CMP_DIFFERENT &&
- newdb == state->extrablk.blkno) {
- ASSERT(state->extravalid);
- curbp = state->extrablk.bp;
- } else {
- error = xfs_dir3_data_read(tp, dp,
- xfs_dir2_db_to_da(args->geo,
- newdb),
+ error = xfs_dir3_data_read(tp, dp,
+ xfs_dir2_db_to_da(args->geo, newdb),
-1, &curbp);
- if (error)
- return error;
- }
+ if (error)
+ return error;
xfs_dir3_data_check(dp, curbp);
curdb = newdb;
}
@@ -766,53 +760,40 @@ xfs_dir2_leafn_lookup_for_entry(
xfs_dir2_dataptr_to_off(args->geo,
be32_to_cpu(lep->address)));
/*
- * Compare the entry and if it's an exact match, return
- * EEXIST immediately. If it's the first case-insensitive
- * match, store the block & inode number and continue looking.
+ * Compare the entry and if it's a match, return
+ * EEXIST immediately.
*/
cmp = mp->m_dirnameops->compname(args, dep->name, dep->namelen);
- if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) {
- /* If there is a CI match block, drop it */
- if (args->cmpresult != XFS_CMP_DIFFERENT &&
- curdb != state->extrablk.blkno)
- xfs_trans_brelse(tp, state->extrablk.bp);
+ if (cmp != XFS_CMP_DIFFERENT) {
args->cmpresult = cmp;
args->inumber = be64_to_cpu(dep->inumber);
args->filetype = dp->d_ops->data_get_ftype(dep);
- *indexp = index;
- state->extravalid = 1;
- state->extrablk.bp = curbp;
- state->extrablk.blkno = curdb;
- state->extrablk.index = (int)((char *)dep -
- (char *)curbp->b_addr);
- state->extrablk.magic = XFS_DIR2_DATA_MAGIC;
curbp->b_ops = &xfs_dir3_data_buf_ops;
xfs_trans_buf_set_type(tp, curbp, XFS_BLFT_DIR_DATA_BUF);
- if (cmp == XFS_CMP_EXACT)
- return -EEXIST;
+ di = (int)((char *)dep - (char *)curbp->b_addr);
+ error = -EEXIST;
+ goto out;
+
}
}
+ /* Didn't find a match */
+ error = -ENOENT;
ASSERT(index == leafhdr.count || (args->op_flags & XFS_DA_OP_OKNOENT));
+out:
if (curbp) {
- if (args->cmpresult == XFS_CMP_DIFFERENT) {
- /* Giving back last used data block. */
- state->extravalid = 1;
- state->extrablk.bp = curbp;
- state->extrablk.index = -1;
- state->extrablk.blkno = curdb;
- state->extrablk.magic = XFS_DIR2_DATA_MAGIC;
- curbp->b_ops = &xfs_dir3_data_buf_ops;
- xfs_trans_buf_set_type(tp, curbp, XFS_BLFT_DIR_DATA_BUF);
- } else {
- /* If the curbp is not the CI match block, drop it */
- if (state->extrablk.bp != curbp)
- xfs_trans_brelse(tp, curbp);
- }
+ /* Giving back last used data block. */
+ state->extravalid = 1;
+ state->extrablk.bp = curbp;
+ state->extrablk.index = di;
+ state->extrablk.blkno = curdb;
+ state->extrablk.magic = XFS_DIR2_DATA_MAGIC;
+ curbp->b_ops = &xfs_dir3_data_buf_ops;
+ xfs_trans_buf_set_type(tp, curbp, XFS_BLFT_DIR_DATA_BUF);
} else {
state->extravalid = 0;
}
*indexp = index;
- return -ENOENT;
+ return error;
}

/*
diff --git a/fs/xfs/libxfs/xfs_dir2_sf.c b/fs/xfs/libxfs/xfs_dir2_sf.c
index 5079e05..e69fdb7 100644
--- a/fs/xfs/libxfs/xfs_dir2_sf.c
+++ b/fs/xfs/libxfs/xfs_dir2_sf.c
@@ -757,19 +757,19 @@ xfs_dir2_sf_lookup(
for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp); i < sfp->count;
i++, sfep = dp->d_ops->sf_nextentry(sfp, sfep)) {
/*
- * Compare name and if it's an exact match, return the inode
- * number. If it's the first case-insensitive match, store the
- * inode number and continue looking for an exact match.
+ * Compare name and if it's a match, return the inode
+ * number.
*/
cmp = dp->i_mount->m_dirnameops->compname(args, sfep->name,
sfep->namelen);
- if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) {
+ if (cmp != XFS_CMP_DIFFERENT) {
args->cmpresult = cmp;
args->inumber = dp->d_ops->sf_get_ino(sfp, sfep);
args->filetype = dp->d_ops->sf_get_ftype(sfep);
if (cmp == XFS_CMP_EXACT)
return -EEXIST;
ci_sfep = sfep;
+ break;
}
}
ASSERT(args->op_flags & XFS_DA_OP_OKNOENT);
--
1.7.12.4
Ben Myers
2014-09-11 20:41:45 UTC
Permalink
From: Olaf Weber <***@sgi.com>

Rename XFS_CMP_CASE to XFS_CMP_MATCH. With unicode filenames and
normalization, different strings will match on other criteria than
case insensitivity.

Signed-off-by: Olaf Weber <***@sgi.com>
---
fs/xfs/libxfs/xfs_da_btree.h | 2 +-
fs/xfs/libxfs/xfs_dir2.c | 9 ++++++---
fs/xfs/libxfs/xfs_dir2_node.c | 2 +-
3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h
index 6e153e3..9ebcc23 100644
--- a/fs/xfs/libxfs/xfs_da_btree.h
+++ b/fs/xfs/libxfs/xfs_da_btree.h
@@ -52,7 +52,7 @@ struct xfs_da_geometry {
enum xfs_dacmp {
XFS_CMP_DIFFERENT, /* names are completely different */
XFS_CMP_EXACT, /* names are exactly the same */
- XFS_CMP_CASE /* names are same but differ in case */
+ XFS_CMP_MATCH /* names are same but differ in encoding */
};

/*
diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
index 6cef221..32e769b 100644
--- a/fs/xfs/libxfs/xfs_dir2.c
+++ b/fs/xfs/libxfs/xfs_dir2.c
@@ -74,7 +74,7 @@ xfs_ascii_ci_compname(
continue;
if (tolower(args->name[i]) != tolower(name[i]))
return XFS_CMP_DIFFERENT;
- result = XFS_CMP_CASE;
+ result = XFS_CMP_MATCH;
}

return result;
@@ -315,8 +315,11 @@ xfs_dir_cilookup_result(
{
if (args->cmpresult == XFS_CMP_DIFFERENT)
return -ENOENT;
- if (args->cmpresult != XFS_CMP_CASE ||
- !(args->op_flags & XFS_DA_OP_CILOOKUP))
+ if (args->cmpresult == XFS_CMP_EXACT)
+ return -EEXIST;
+ ASSERT(args->cmpresult == XFS_CMP_MATCH);
+ /* Only dup the found name if XFS_DA_OP_CILOOKUP is set. */
+ if (!(args->op_flags & XFS_DA_OP_CILOOKUP))
return -EEXIST;

args->value = kmem_alloc(len, KM_NOFS | KM_MAYFAIL);
diff --git a/fs/xfs/libxfs/xfs_dir2_node.c b/fs/xfs/libxfs/xfs_dir2_node.c
index 1778c40..9d46e8d 100644
--- a/fs/xfs/libxfs/xfs_dir2_node.c
+++ b/fs/xfs/libxfs/xfs_dir2_node.c
@@ -2023,7 +2023,7 @@ xfs_dir2_node_lookup(
error = xfs_da3_node_lookup_int(state, &rval);
if (error)
rval = error;
- else if (rval == -ENOENT && args->cmpresult == XFS_CMP_CASE) {
+ else if (rval == -ENOENT && args->cmpresult == XFS_CMP_MATCH) {
/* If a CI match, dup the actual name and return -EEXIST */
xfs_dir2_data_entry_t *dep;
--
1.7.12.4
Ben Myers
2014-09-11 20:42:55 UTC
Permalink
From: Olaf Weber <***@sgi.com>

Add a normhash callout to the xfs_nameops. This callout takes an xfs_da_args
structure as its argument, and calculates a hash value over the name. It may
in the process create a normalized form of the name, and assign that to the
norm/normlen fields in the xfs_da_args structure.

Signed-off-by: Olaf Weber <***@sgi.com>
---
fs/xfs/libxfs/xfs_da_btree.c | 9 +++++++++
fs/xfs/libxfs/xfs_da_btree.h | 3 +++
fs/xfs/libxfs/xfs_dir2.c | 42 +++++++++++++++++++++++++++++++++++++-----
3 files changed, 49 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index 2c42ae2..07a3acf 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -1990,8 +1990,17 @@ xfs_default_hashname(
return xfs_da_hashname(name->name, name->len);
}

+STATIC int
+xfs_da_normhash(
+ struct xfs_da_args *args)
+{
+ args->hashval = xfs_da_hashname(args->name, args->namelen);
+ return 0;
+}
+
const struct xfs_nameops xfs_default_nameops = {
.hashname = xfs_default_hashname,
+ .normhash = xfs_da_normhash,
.compname = xfs_da_compname
};

diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h
index 9ebcc23..6cdafee 100644
--- a/fs/xfs/libxfs/xfs_da_btree.h
+++ b/fs/xfs/libxfs/xfs_da_btree.h
@@ -61,7 +61,9 @@ enum xfs_dacmp {
typedef struct xfs_da_args {
struct xfs_da_geometry *geo; /* da block geometry */
const __uint8_t *name; /* string (maybe not NULL terminated) */
+ const __uint8_t *norm; /* normalized name (may be NULL) */
int namelen; /* length of string (maybe no NULL) */
+ int normlen; /* length of normalized name */
__uint8_t filetype; /* filetype of inode for directories */
__uint8_t *value; /* set of bytes (maybe contain NULLs) */
int valuelen; /* length of value */
@@ -150,6 +152,7 @@ typedef struct xfs_da_state {
*/
struct xfs_nameops {
xfs_dahash_t (*hashname)(struct xfs_name *);
+ int (*normhash)(struct xfs_da_args *);
enum xfs_dacmp (*compname)(struct xfs_da_args *,
const unsigned char *, int);
};
diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
index 32e769b..55733a6 100644
--- a/fs/xfs/libxfs/xfs_dir2.c
+++ b/fs/xfs/libxfs/xfs_dir2.c
@@ -56,6 +56,21 @@ xfs_ascii_ci_hashname(
return hash;
}

+STATIC int
+xfs_ascii_ci_normhash(
+ struct xfs_da_args *args)
+{
+ xfs_dahash_t hash;
+ int i;
+
+ for (i = 0, hash = 0; i < args->namelen; i++)
+ hash = tolower(args->name[i]) ^ rol32(hash, 7);
+
+ args->hashval = hash;
+ return 0;
+}
+
+
STATIC enum xfs_dacmp
xfs_ascii_ci_compname(
struct xfs_da_args *args,
@@ -82,6 +97,7 @@ xfs_ascii_ci_compname(

static struct xfs_nameops xfs_ascii_ci_nameops = {
.hashname = xfs_ascii_ci_hashname,
+ .normhash = xfs_ascii_ci_normhash,
.compname = xfs_ascii_ci_compname,
};

@@ -267,7 +283,6 @@ xfs_dir_createname(
args->name = name->name;
args->namelen = name->len;
args->filetype = name->type;
- args->hashval = dp->i_mount->m_dirnameops->hashname(name);
args->inumber = inum;
args->dp = dp;
args->firstblock = first;
@@ -276,6 +291,8 @@ xfs_dir_createname(
args->whichfork = XFS_DATA_FORK;
args->trans = tp;
args->op_flags = XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT;
+ if ((rval = dp->i_mount->m_dirnameops->normhash(args)))
+ goto out_free;

if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) {
rval = xfs_dir2_sf_addname(args);
@@ -299,6 +316,8 @@ xfs_dir_createname(
rval = xfs_dir2_node_addname(args);

out_free:
+ if (args->norm)
+ kmem_free(args->norm);
kmem_free(args);
return rval;
}
@@ -365,13 +384,14 @@ xfs_dir_lookup(
args->name = name->name;
args->namelen = name->len;
args->filetype = name->type;
- args->hashval = dp->i_mount->m_dirnameops->hashname(name);
args->dp = dp;
args->whichfork = XFS_DATA_FORK;
args->trans = tp;
args->op_flags = XFS_DA_OP_OKNOENT;
if (ci_name)
args->op_flags |= XFS_DA_OP_CILOOKUP;
+ if ((rval = dp->i_mount->m_dirnameops->normhash(args)))
+ goto out_free;

if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) {
rval = xfs_dir2_sf_lookup(args);
@@ -405,6 +425,9 @@ out_check_rval:
}
}
out_free:
+ if (args->norm)
+ kmem_free(args->norm);
+
kmem_free(args);
return rval;
}
@@ -437,7 +460,6 @@ xfs_dir_removename(
args->name = name->name;
args->namelen = name->len;
args->filetype = name->type;
- args->hashval = dp->i_mount->m_dirnameops->hashname(name);
args->inumber = ino;
args->dp = dp;
args->firstblock = first;
@@ -445,6 +467,8 @@ xfs_dir_removename(
args->total = total;
args->whichfork = XFS_DATA_FORK;
args->trans = tp;
+ if ((rval = dp->i_mount->m_dirnameops->normhash(args)))
+ goto out_free;

if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) {
rval = xfs_dir2_sf_removename(args);
@@ -467,6 +491,8 @@ xfs_dir_removename(
else
rval = xfs_dir2_node_removename(args);
out_free:
+ if (args->norm)
+ kmem_free(args->norm);
kmem_free(args);
return rval;
}
@@ -502,7 +528,6 @@ xfs_dir_replace(
args->name = name->name;
args->namelen = name->len;
args->filetype = name->type;
- args->hashval = dp->i_mount->m_dirnameops->hashname(name);
args->inumber = inum;
args->dp = dp;
args->firstblock = first;
@@ -510,6 +535,8 @@ xfs_dir_replace(
args->total = total;
args->whichfork = XFS_DATA_FORK;
args->trans = tp;
+ if ((rval = dp->i_mount->m_dirnameops->normhash(args)))
+ goto out_free;

if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) {
rval = xfs_dir2_sf_replace(args);
@@ -532,6 +559,8 @@ xfs_dir_replace(
else
rval = xfs_dir2_node_replace(args);
out_free:
+ if (args->norm)
+ kmem_free(args->norm);
kmem_free(args);
return rval;
}
@@ -564,12 +593,13 @@ xfs_dir_canenter(
args->name = name->name;
args->namelen = name->len;
args->filetype = name->type;
- args->hashval = dp->i_mount->m_dirnameops->hashname(name);
args->dp = dp;
args->whichfork = XFS_DATA_FORK;
args->trans = tp;
args->op_flags = XFS_DA_OP_JUSTCHECK | XFS_DA_OP_ADDNAME |
XFS_DA_OP_OKNOENT;
+ if ((rval = dp->i_mount->m_dirnameops->normhash(args)))
+ goto out_free;

if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL) {
rval = xfs_dir2_sf_addname(args);
@@ -592,6 +622,8 @@ xfs_dir_canenter(
else
rval = xfs_dir2_node_addname(args);
out_free:
+ if (args->norm)
+ kmem_free(args->norm);
kmem_free(args);
return rval;
}
--
1.7.12.4
Ben Myers
2014-09-11 20:43:56 UTC
Permalink
From: Olaf Weber <***@sgi.com>

With the introduction of the xfs_nameops.normhash callout, all uses of the
hashname callout now occur in places where an xfs_name structure must be
explicitly created just to match the parameter passing convention of this
callout. Change the arguments to a const unsigned char * and int instead.

Signed-off-by: Olaf Weber <***@sgi.com>
---
fs/xfs/libxfs/xfs_da_btree.c | 9 +--------
fs/xfs/libxfs/xfs_da_btree.h | 2 +-
fs/xfs/libxfs/xfs_dir2.c | 7 ++++---
fs/xfs/libxfs/xfs_dir2_block.c | 2 +-
fs/xfs/libxfs/xfs_dir2_data.c | 3 ++-
5 files changed, 9 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index 07a3acf..a0608ca 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -1983,13 +1983,6 @@ xfs_da_compname(
XFS_CMP_EXACT : XFS_CMP_DIFFERENT;
}

-static xfs_dahash_t
-xfs_default_hashname(
- struct xfs_name *name)
-{
- return xfs_da_hashname(name->name, name->len);
-}
-
STATIC int
xfs_da_normhash(
struct xfs_da_args *args)
@@ -1999,7 +1992,7 @@ xfs_da_normhash(
}

const struct xfs_nameops xfs_default_nameops = {
- .hashname = xfs_default_hashname,
+ .hashname = xfs_da_hashname,
.normhash = xfs_da_normhash,
.compname = xfs_da_compname
};
diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h
index 6cdafee..4d6b36f 100644
--- a/fs/xfs/libxfs/xfs_da_btree.h
+++ b/fs/xfs/libxfs/xfs_da_btree.h
@@ -151,7 +151,7 @@ typedef struct xfs_da_state {
* Name ops for directory and/or attr name operations
*/
struct xfs_nameops {
- xfs_dahash_t (*hashname)(struct xfs_name *);
+ xfs_dahash_t (*hashname)(const unsigned char *, int);
int (*normhash)(struct xfs_da_args *);
enum xfs_dacmp (*compname)(struct xfs_da_args *,
const unsigned char *, int);
diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
index 55733a6..84e5ca9 100644
--- a/fs/xfs/libxfs/xfs_dir2.c
+++ b/fs/xfs/libxfs/xfs_dir2.c
@@ -45,13 +45,14 @@ struct xfs_name xfs_name_dotdot = { (unsigned char *)"..", 2, XFS_DIR3_FT_DIR };
*/
STATIC xfs_dahash_t
xfs_ascii_ci_hashname(
- struct xfs_name *name)
+ const unsigned char *name,
+ int len)
{
xfs_dahash_t hash;
int i;

- for (i = 0, hash = 0; i < name->len; i++)
- hash = tolower(name->name[i]) ^ rol32(hash, 7);
+ for (i = 0, hash = 0; i < len; i++)
+ hash = tolower(name[i]) ^ rol32(hash, 7);

return hash;
}
diff --git a/fs/xfs/libxfs/xfs_dir2_block.c b/fs/xfs/libxfs/xfs_dir2_block.c
index 990bf0c..f93c141 100644
--- a/fs/xfs/libxfs/xfs_dir2_block.c
+++ b/fs/xfs/libxfs/xfs_dir2_block.c
@@ -1231,7 +1231,7 @@ xfs_dir2_sf_to_block(
name.name = sfep->name;
name.len = sfep->namelen;
blp[2 + i].hashval = cpu_to_be32(mp->m_dirnameops->
- hashname(&name));
+ hashname(sfep->name, sfep->namelen));
blp[2 + i].address = cpu_to_be32(xfs_dir2_byte_to_dataptr(
(char *)dep - (char *)hdr));
offset = (int)((char *)(tagp + 1) - (char *)hdr);
diff --git a/fs/xfs/libxfs/xfs_dir2_data.c b/fs/xfs/libxfs/xfs_dir2_data.c
index fdd803f..28c35cf 100644
--- a/fs/xfs/libxfs/xfs_dir2_data.c
+++ b/fs/xfs/libxfs/xfs_dir2_data.c
@@ -179,7 +179,8 @@ __xfs_dir3_data_check(
((char *)dep - (char *)hdr));
name.name = dep->name;
name.len = dep->namelen;
- hash = mp->m_dirnameops->hashname(&name);
+ hash = mp->m_dirnameops->hashname(dep->name,
+ dep->namelen);
for (i = 0; i < be32_to_cpu(btp->count); i++) {
if (be32_to_cpu(lep[i].address) == addr &&
be32_to_cpu(lep[i].hashval) == hash)
--
1.7.12.4
Ben Myers
2014-09-11 20:46:06 UTC
Permalink
From: Olaf Weber <***@sgi.com>

When UTF-8 support is enabled, the xfs_dir_ci_inode_operations must be
installed. Add xfs_sb_version_hasci(), which tests both the borgbit and
the utf8bit, and returns true if at least one of them is set. Replace
calls to xfs_sb_version_hasasciici() as needed.

Signed-off-by: Olaf Weber <***@sgi.com>
---
fs/xfs/libxfs/xfs_sb.h | 24 +++++++++++++++++++++++-
fs/xfs/xfs_fs.h | 1 +
fs/xfs/xfs_fsops.c | 4 +++-
fs/xfs/xfs_iops.c | 4 ++--
4 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_sb.h b/fs/xfs/libxfs/xfs_sb.h
index 2e73970..525eacb 100644
--- a/fs/xfs/libxfs/xfs_sb.h
+++ b/fs/xfs/libxfs/xfs_sb.h
@@ -70,6 +70,7 @@ struct xfs_trans;
#define XFS_SB_VERSION2_RESERVED4BIT 0x00000004
#define XFS_SB_VERSION2_ATTR2BIT 0x00000008 /* Inline attr rework */
#define XFS_SB_VERSION2_PARENTBIT 0x00000010 /* parent pointers */
+#define XFS_SB_VERSION2_UTF8BIT 0x00000020 /* utf8 names */
#define XFS_SB_VERSION2_PROJID32BIT 0x00000080 /* 32 bit project id */
#define XFS_SB_VERSION2_CRCBIT 0x00000100 /* metadata CRCs */
#define XFS_SB_VERSION2_FTYPE 0x00000200 /* inode type in dir */
@@ -77,6 +78,7 @@ struct xfs_trans;
#define XFS_SB_VERSION2_OKBITS \
(XFS_SB_VERSION2_LAZYSBCOUNTBIT | \
XFS_SB_VERSION2_ATTR2BIT | \
+ XFS_SB_VERSION2_UTF8BIT | \
XFS_SB_VERSION2_PROJID32BIT | \
XFS_SB_VERSION2_FTYPE)

@@ -509,8 +511,10 @@ xfs_sb_has_ro_compat_feature(
}

#define XFS_SB_FEAT_INCOMPAT_FTYPE (1 << 0) /* filetype in dirent */
+#define XFS_SB_FEAT_INCOMPAT_UTF8 (1 << 1) /* utf-8 name support */
#define XFS_SB_FEAT_INCOMPAT_ALL \
- (XFS_SB_FEAT_INCOMPAT_FTYPE)
+ (XFS_SB_FEAT_INCOMPAT_FTYPE | \
+ XFS_SB_FEAT_INCOMPAT_UTF8)

#define XFS_SB_FEAT_INCOMPAT_UNKNOWN ~XFS_SB_FEAT_INCOMPAT_ALL
static inline bool
@@ -558,6 +562,24 @@ static inline int xfs_sb_version_hasfinobt(xfs_sb_t *sbp)
(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FINOBT);
}

+static inline int xfs_sb_version_hasutf8(xfs_sb_t *sbp)
+{
+ return (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5 &&
+ xfs_sb_has_incompat_feature(sbp, XFS_SB_FEAT_INCOMPAT_UTF8)) ||
+ (xfs_sb_version_hasmorebits(sbp) &&
+ (sbp->sb_features2 & XFS_SB_VERSION2_UTF8BIT));
+}
+
+/*
+ * Special case: there are a number of places where we need to test
+ * both the borgbit and the utf8bit, and take the same action if
+ * either of those is set.
+ */
+static inline int xfs_sb_version_hasci(xfs_sb_t *sbp)
+{
+ return xfs_sb_version_hasasciici(sbp) || xfs_sb_version_hasutf8(sbp);
+}
+
/*
* end of superblock version macros
*/
diff --git a/fs/xfs/xfs_fs.h b/fs/xfs/xfs_fs.h
index 18dc721..e845d75 100644
--- a/fs/xfs/xfs_fs.h
+++ b/fs/xfs/xfs_fs.h
@@ -239,6 +239,7 @@ typedef struct xfs_fsop_resblks {
#define XFS_FSOP_GEOM_FLAGS_V5SB 0x8000 /* version 5 superblock */
#define XFS_FSOP_GEOM_FLAGS_FTYPE 0x10000 /* inode directory types */
#define XFS_FSOP_GEOM_FLAGS_FINOBT 0x20000 /* free inode btree */
+#define XFS_FSOP_GEOM_FLAGS_UTF8 0x40000 /* utf8 filenames */

/*
* Minimum and maximum sizes need for growth checks.
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index f91de1e..1a83eef 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -103,7 +103,9 @@ xfs_fs_geometry(
(xfs_sb_version_hasftype(&mp->m_sb) ?
XFS_FSOP_GEOM_FLAGS_FTYPE : 0) |
(xfs_sb_version_hasfinobt(&mp->m_sb) ?
- XFS_FSOP_GEOM_FLAGS_FINOBT : 0);
+ XFS_FSOP_GEOM_FLAGS_FINOBT : 0) |
+ (xfs_sb_version_hasutf8(&mp->m_sb) ?
+ XFS_FSOP_GEOM_FLAGS_UTF8 : 0);
geo->logsectsize = xfs_sb_version_hassector(&mp->m_sb) ?
mp->m_sb.sb_logsectsize : BBSIZE;
geo->rtsectsize = mp->m_sb.sb_blocksize;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 7212949..cea3d64 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -335,9 +335,9 @@ xfs_vn_unlink(
/*
* With unlink, the VFS makes the dentry "negative": no inode,
* but still hashed. This is incompatible with case-insensitive
- * mode, so invalidate (unhash) the dentry in CI-mode.
+ * or utf8 mode, so invalidate (unhash) the dentry in CI-mode.
*/
- if (xfs_sb_version_hasasciici(&XFS_M(dir->i_sb)->m_sb))
+ if (xfs_sb_version_hasci(&XFS_M(dir->i_sb)->m_sb))
d_invalidate(dentry);
return 0;
}
--
1.7.12.4
Ben Myers
2014-09-11 20:47:01 UTC
Permalink
From: Olaf Weber <***@sgi.com>

Add files from the Unicode Character Database, version 7.0.0, to the source.
A helper program that generates a trie used for normalization from these
files is part of a separate commit.

Signed-off-by: Olaf Weber <***@sgi.com>
---
[v2: Removed large unicode files prior to posting. Get them as below.
-bpm]

cd fs/xfs/support/ucd
wget http://www.unicode.org/Public/7.0.0/ucd/CaseFolding.txt
wget http://www.unicode.org/Public/7.0.0/ucd/DerivedAge.txt
wget http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedCombiningClass.txt
wget http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt
wget http://www.unicode.org/Public/7.0.0/ucd/NormalizationCorrections.txt
wget http://www.unicode.org/Public/7.0.0/ucd/NormalizationTest.txt
wget http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
for e in *.txt
do
base=`basename $e .txt`
mv $e $base-7.0.0.txt
done
---
fs/xfs/support/ucd/README | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)
create mode 100644 fs/xfs/support/ucd/README

diff --git a/fs/xfs/support/ucd/README b/fs/xfs/support/ucd/README
new file mode 100644
index 0000000..d713e66
--- /dev/null
+++ b/fs/xfs/support/ucd/README
@@ -0,0 +1,33 @@
+The files in this directory are part of the Unicode Character Database
+for version 7.0.0 of the Unicode standard.
+
+The full set of files can be found here:
+
+ http://www.unicode.org/Public/7.0.0/ucd/
+
+The latest released version of the UCD can be found here:
+
+ http://www.unicode.org/Public/UCD/latest/
+
+The files in this directory are identical, except that they have been
+renamed with a suffix indicating the unicode version.
+
+Individual source links:
+
+ http://www.unicode.org/Public/7.0.0/ucd/CaseFolding.txt
+ http://www.unicode.org/Public/7.0.0/ucd/DerivedAge.txt
+ http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedCombiningClass.txt
+ http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt
+ http://www.unicode.org/Public/7.0.0/ucd/NormalizationCorrections.txt
+ http://www.unicode.org/Public/7.0.0/ucd/NormalizationTest.txt
+ http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
+
+md5sums
+
+ 9a92b2bfe56c6719def926bab524fefd CaseFolding-7.0.0.txt
+ 07b8b1027eb824cf0835314e94f23d2e DerivedAge-7.0.0.txt
+ 90c3340b16821e2f2153acdbe6fc6180 DerivedCombiningClass-7.0.0.txt
+ c41c0601f808116f623de47110ed4f93 DerivedCoreProperties-7.0.0.txt
+ 522720ddfc150d8e63a2518634829bce NormalizationCorrections-7.0.0.txt
+ 1f35175eba4a2ad795db489f789ae352 NormalizationTest-7.0.0.txt
+ c8355655731d75e6a3de8c20d7e601ba UnicodeData-7.0.0.txt
--
1.7.12.4
Ben Myers
2014-09-11 20:48:17 UTC
Permalink
From: Olaf Weber <***@sgi.com>

mkutf8data.c is the source for a program that generates utf8data.h, which
contains the trie that utf8norm.c uses. The trie is generated from the
Unicode 7.0.0 data files. The format of the utf8data[] table is described
in utf8norm.c.

Supporting functions for UTF-8 normalization are in utf8norm.c with the
header utf8norm.h. Two normalization forms are supported: nfkdi and nfkdicf.

nfkdi:
- Apply unicode normalization form NFKD.
- Remove any Default_Ignorable_Code_Point.

nfkdicf:
- Apply unicode normalization form NFKD.
- Remove any Default_Ignorable_Code_Point.
- Apply a full casefold (C + F).

For the purposes of the code, a string is valid UTF-8 if:

- The values encoded are 0x1..0x10FFFF.
- The surrogate codepoints 0xD800..0xDFFFF are not encoded.
- The shortest possible encoding is used for all values.

The supporting functions work on null-terminated strings (utf8 prefix) and
on length-limited strings (utf8n prefix).

Signed-off-by: Olaf Weber <***@sgi.com>
---
fs/xfs/Makefile | 19 +
fs/xfs/support/mkutf8data.c | 3239 +++++++++++++++++++++++++++++++++++++++++++
fs/xfs/support/utf8norm.c | 641 +++++++++
fs/xfs/support/utf8norm.h | 111 ++
4 files changed, 4010 insertions(+)
create mode 100644 fs/xfs/support/mkutf8data.c
create mode 100644 fs/xfs/support/utf8norm.c
create mode 100644 fs/xfs/support/utf8norm.h

diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index d617999..0f7b300 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -92,6 +92,25 @@ xfs-y += xfs_aops.o \
kmem.o \
uuid.o

+# Objects in support/
+xfs-y += support/utf8norm.o
+
+hostprogs-y := support/mkutf8data
+$(obj)/support/utf8norm.o: $(obj)/support/utf8data.h
+$(obj)/support/utf8data.h: $(src)/support/ucd/*.txt
+$(obj)/support/utf8data.h: $(obj)/support/mkutf8data FORCE
+ $(call if_changed,mkutf8data)
+quiet_cmd_mkutf8data = MKUTF8DATA $@
+ cmd_mkutf8data = $(obj)/support/mkutf8data \
+ -a $(src)/support/ucd/DerivedAge-7.0.0.txt \
+ -c $(src)/support/ucd/DerivedCombiningClass-7.0.0.txt \
+ -p $(src)/support/ucd/DerivedCoreProperties-7.0.0.txt \
+ -d $(src)/support/ucd/UnicodeData-7.0.0.txt \
+ -f $(src)/support/ucd/CaseFolding-7.0.0.txt \
+ -n $(src)/support/ucd/NormalizationCorrections-7.0.0.txt \
+ -t $(src)/support/ucd/NormalizationTest-7.0.0.txt \
+ -o $@
+
# low-level transaction/log code
xfs-y += xfs_log.o \
xfs_log_cil.o \
diff --git a/fs/xfs/support/mkutf8data.c b/fs/xfs/support/mkutf8data.c
new file mode 100644
index 0000000..cff7a1e
--- /dev/null
+++ b/fs/xfs/support/mkutf8data.c
@@ -0,0 +1,3239 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+/* Generator for a compact trie for unicode normalization */
+
+#include <sys/types.h>
+#include <stddef.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <assert.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+
+/* Default names of the in- and output files. */
+
+#define AGE_NAME "DerivedAge.txt"
+#define CCC_NAME "DerivedCombiningClass.txt"
+#define PROP_NAME "DerivedCoreProperties.txt"
+#define DATA_NAME "UnicodeData.txt"
+#define FOLD_NAME "CaseFolding.txt"
+#define NORM_NAME "NormalizationCorrections.txt"
+#define TEST_NAME "NormalizationTest.txt"
+#define UTF8_NAME "utf8data.h"
+
+const char *age_name = AGE_NAME;
+const char *ccc_name = CCC_NAME;
+const char *prop_name = PROP_NAME;
+const char *data_name = DATA_NAME;
+const char *fold_name = FOLD_NAME;
+const char *norm_name = NORM_NAME;
+const char *test_name = TEST_NAME;
+const char *utf8_name = UTF8_NAME;
+
+int verbose = 0;
+
+/* An arbitrary line size limit on input lines. */
+
+#define LINESIZE 1024
+char line[LINESIZE];
+char buf0[LINESIZE];
+char buf1[LINESIZE];
+char buf2[LINESIZE];
+char buf3[LINESIZE];
+
+const char *argv0;
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Unicode version numbers consist of three parts: major, minor, and a
+ * revision. These numbers are packed into an unsigned int to obtain
+ * a single version number.
+ *
+ * To save space in the generated trie, the unicode version is not
+ * stored directly, instead we calculate a generation number from the
+ * unicode versions seen in the DerivedAge file, and use that as an
+ * index into a table of unicode versions.
+ */
+#define UNICODE_MAJ_SHIFT (16)
+#define UNICODE_MIN_SHIFT (8)
+
+#define UNICODE_MAJ_MAX ((unsigned short)-1)
+#define UNICODE_MIN_MAX ((unsigned char)-1)
+#define UNICODE_REV_MAX ((unsigned char)-1)
+
+#define UNICODE_AGE(MAJ,MIN,REV) \
+ (((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) | \
+ ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) | \
+ ((unsigned int)(REV)))
+
+unsigned int *ages;
+int ages_count;
+
+unsigned int unicode_maxage;
+
+static int
+age_valid(unsigned int major, unsigned int minor, unsigned int revision)
+{
+ if (major > UNICODE_MAJ_MAX)
+ return 0;
+ if (minor > UNICODE_MIN_MAX)
+ return 0;
+ if (revision > UNICODE_REV_MAX)
+ return 0;
+ return 1;
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * utf8trie_t
+ *
+ * A compact binary tree, used to decode UTF-8 characters.
+ *
+ * Internal nodes are one byte for the node itself, and up to three
+ * bytes for an offset into the tree. The first byte contains the
+ * following information:
+ * NEXTBYTE - flag - advance to next byte if set
+ * BITNUM - 3 bit field - the bit number to tested
+ * OFFLEN - 2 bit field - number of bytes in the offset
+ * if offlen == 0 (non-branching node)
+ * RIGHTPATH - 1 bit field - set if the following node is for the
+ * right-hand path (tested bit is set)
+ * TRIENODE - 1 bit field - set if the following node is an internal
+ * node, otherwise it is a leaf node
+ * if offlen != 0 (branching node)
+ * LEFTNODE - 1 bit field - set if the left-hand node is internal
+ * RIGHTNODE - 1 bit field - set if the right-hand node is internal
+ *
+ * Due to the way utf8 works, there cannot be branching nodes with
+ * NEXTBYTE set, and moreover those nodes always have a righthand
+ * descendant.
+ */
+typedef unsigned char utf8trie_t;
+#define BITNUM 0x07
+#define NEXTBYTE 0x08
+#define OFFLEN 0x30
+#define OFFLEN_SHIFT 4
+#define RIGHTPATH 0x40
+#define TRIENODE 0x80
+#define RIGHTNODE 0x40
+#define LEFTNODE 0x80
+
+/*
+ * utf8leaf_t
+ *
+ * The leaves of the trie are embedded in the trie, and so the same
+ * underlying datatype, unsigned char.
+ *
+ * leaf[0]: The unicode version, stored as a generation number that is
+ * an index into utf8agetab[]. With this we can filter code
+ * points based on the unicode version in which they were
+ * defined. The CCC of a non-defined code point is 0.
+ * leaf[1]: Canonical Combining Class. During normalization, we need
+ * to do a stable sort into ascending order of all characters
+ * with a non-zero CCC that occur between two characters with
+ * a CCC of 0, or at the begin or end of a string.
+ * The unicode standard guarantees that all CCC values are
+ * between 0 and 254 inclusive, which leaves 255 available as
+ * a special value.
+ * Code points with CCC 0 are known as stoppers.
+ * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the
+ * start of a NUL-terminated string that is the decomposition
+ * of the character.
+ * The CCC of a decomposable character is the same as the CCC
+ * of the first character of its decomposition.
+ * Some characters decompose as the empty string: these are
+ * characters with the Default_Ignorable_Code_Point property.
+ * These do affect normalization, as they all have CCC 0.
+ *
+ * The decompositions in the trie have been fully expanded.
+ *
+ * Casefolding, if applicable, is also done using decompositions.
+ */
+typedef unsigned char utf8leaf_t;
+
+#define LEAF_GEN(LEAF) ((LEAF)[0])
+#define LEAF_CCC(LEAF) ((LEAF)[1])
+#define LEAF_STR(LEAF) ((const char*)((LEAF) + 2))
+
+#define MAXGEN (255)
+
+#define MINCCC (0)
+#define MAXCCC (254)
+#define STOPPER (0)
+#define DECOMPOSE (255)
+
+struct tree;
+static utf8leaf_t *utf8nlookup(struct tree *, const char *, size_t);
+static utf8leaf_t *utf8lookup(struct tree *, const char *);
+
+unsigned char *utf8data;
+size_t utf8data_size;
+
+utf8trie_t *nfkdi;
+utf8trie_t *nfkdicf;
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * UTF8 valid ranges.
+ *
+ * The UTF-8 encoding spreads the bits of a 32bit word over several
+ * bytes. This table gives the ranges that can be held and how they'd
+ * be represented.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * There is an additional requirement on UTF-8, in that only the
+ * shortest representation of a 32bit value is to be used. A decoder
+ * must not decode sequences that do not satisfy this requirement.
+ * Thus the allowed ranges have a lower bound.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * Actual unicode characters are limited to the range 0x0 - 0x10FFFF,
+ * 17 planes of 65536 values. This limits the sequences actually seen
+ * even more, to just the following.
+ *
+ * 0 - 0x7f: 0 0x7f
+ * 0x80 - 0x7ff: 0xc2 0x80 0xdf 0xbf
+ * 0x800 - 0xffff: 0xe0 0xa0 0x80 0xef 0xbf 0xbf
+ * 0x10000 - 0x10ffff: 0xf0 0x90 0x80 0x80 0xf4 0x8f 0xbf 0xbf
+ *
+ * Even within those ranges not all values are allowed: the surrogates
+ * 0xd800 - 0xdfff should never be seen.
+ *
+ * Note that the longest sequence seen with valid usage is 4 bytes,
+ * the same a single UTF-32 character. This makes the UTF-8
+ * representation of Unicode strictly smaller than UTF-32.
+ *
+ * The shortest sequence requirement was introduced by:
+ * Corrigendum #1: UTF-8 Shortest Form
+ * It can be found here:
+ * http://www.unicode.org/versions/corrigendum1.html
+ *
+ */
+
+#define UTF8_2_BITS 0xC0
+#define UTF8_3_BITS 0xE0
+#define UTF8_4_BITS 0xF0
+#define UTF8_N_BITS 0x80
+#define UTF8_2_MASK 0xE0
+#define UTF8_3_MASK 0xF0
+#define UTF8_4_MASK 0xF8
+#define UTF8_N_MASK 0xC0
+#define UTF8_V_MASK 0x3F
+#define UTF8_V_SHIFT 6
+
+static int
+utf8key(unsigned int key, char keyval[])
+{
+ int keylen;
+
+ if (key < 0x80) {
+ keyval[0] = key;
+ keylen = 1;
+ } else if (key < 0x800) {
+ keyval[1] = key & UTF8_V_MASK;
+ keyval[1] |= UTF8_N_BITS;
+ key >>= UTF8_V_SHIFT;
+ keyval[0] = key;
+ keyval[0] |= UTF8_2_BITS;
+ keylen = 2;
+ } else if (key < 0x10000) {
+ keyval[2] = key & UTF8_V_MASK;
+ keyval[2] |= UTF8_N_BITS;
+ key >>= UTF8_V_SHIFT;
+ keyval[1] = key & UTF8_V_MASK;
+ keyval[1] |= UTF8_N_BITS;
+ key >>= UTF8_V_SHIFT;
+ keyval[0] = key;
+ keyval[0] |= UTF8_3_BITS;
+ keylen = 3;
+ } else if (key < 0x110000) {
+ keyval[3] = key & UTF8_V_MASK;
+ keyval[3] |= UTF8_N_BITS;
+ key >>= UTF8_V_SHIFT;
+ keyval[2] = key & UTF8_V_MASK;
+ keyval[2] |= UTF8_N_BITS;
+ key >>= UTF8_V_SHIFT;
+ keyval[1] = key & UTF8_V_MASK;
+ keyval[1] |= UTF8_N_BITS;
+ key >>= UTF8_V_SHIFT;
+ keyval[0] = key;
+ keyval[0] |= UTF8_4_BITS;
+ keylen = 4;
+ } else {
+ printf("%#x: illegal key\n", key);
+ keylen = 0;
+ }
+ return keylen;
+}
+
+static unsigned int
+utf8code(const char *str)
+{
+ const unsigned char *s = (const unsigned char*)str;
+ unsigned int unichar = 0;
+
+ if (*s < 0x80) {
+ unichar = *s;
+ } else if (*s < UTF8_3_BITS) {
+ unichar = *s++ & 0x1F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s & 0x3F;
+ } else if (*s < UTF8_4_BITS) {
+ unichar = *s++ & 0x0F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s++ & 0x3F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s & 0x3F;
+ } else {
+ unichar = *s++ & 0x0F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s++ & 0x3F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s++ & 0x3F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s & 0x3F;
+ }
+ return unichar;
+}
+
+static int
+utf32valid(unsigned int unichar)
+{
+ return unichar < 0x110000;
+}
+
+#define NODE 1
+#define LEAF 0
+
+struct tree {
+ void *root;
+ int childnode;
+ const char *type;
+ unsigned int maxage;
+ struct tree *next;
+ int (*leaf_equal)(void *, void *);
+ void (*leaf_print)(void *, int);
+ int (*leaf_mark)(void *);
+ int (*leaf_size)(void *);
+ int *(*leaf_index)(struct tree *, void *);
+ unsigned char *(*leaf_emit)(void *, unsigned char *);
+ int leafindex[0x110000];
+ int index;
+};
+
+struct node {
+ int index;
+ int offset;
+ int mark;
+ int size;
+ struct node *parent;
+ void *left;
+ void *right;
+ unsigned char bitnum;
+ unsigned char nextbyte;
+ unsigned char leftnode;
+ unsigned char rightnode;
+ unsigned int keybits;
+ unsigned int keymask;
+};
+
+/*
+ * Example lookup function for a tree.
+ */
+static void *
+lookup(struct tree *tree, const char *key)
+{
+ struct node *node;
+ void *leaf = NULL;
+
+ node = tree->root;
+ while (!leaf && node) {
+ if (node->nextbyte)
+ key++;
+ if (*key & (1 << (node->bitnum & 7))) {
+ /* Right leg */
+ if (node->rightnode == NODE) {
+ node = node->right;
+ } else if (node->rightnode == LEAF) {
+ leaf = node->right;
+ } else {
+ node = NULL;
+ }
+ } else {
+ /* Left leg */
+ if (node->leftnode == NODE) {
+ node = node->left;
+ } else if (node->leftnode == LEAF) {
+ leaf = node->left;
+ } else {
+ node = NULL;
+ }
+ }
+ }
+
+ return leaf;
+}
+
+/*
+ * A simple non-recursive tree walker: keep track of visits to the
+ * left and right branches in the leftmask and rightmask.
+ */
+static void
+tree_walk(struct tree *tree)
+{
+ struct node *node;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int indent = 1;
+ int nodes, singletons, leaves;
+
+ nodes = singletons = leaves = 0;
+
+ printf("%s_%x root %p\n", tree->type, tree->maxage, tree->root);
+ if (tree->childnode == LEAF) {
+ assert(tree->root);
+ tree->leaf_print(tree->root, indent);
+ leaves = 1;
+ } else {
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ printf("%*snode @ %p bitnum %d nextbyte %d"
+ " left %p right %p mask %x bits %x\n",
+ indent, "", node,
+ node->bitnum, node->nextbyte,
+ node->left, node->right,
+ node->keymask, node->keybits);
+ nodes += 1;
+ if (!(node->left && node->right))
+ singletons += 1;
+
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if ((leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ tree->leaf_print(node->left,
+ indent+1);
+ leaves += 1;
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ indent += 1;
+ node = node->left;
+ break;
+ }
+ }
+ if ((rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ tree->leaf_print(node->right,
+ indent+1);
+ leaves += 1;
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ indent += 1;
+ node = node->right;
+ break;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ indent -= 1;
+ }
+ }
+ }
+ printf("nodes %d leaves %d singletons %d\n",
+ nodes, leaves, singletons);
+}
+
+/*
+ * Allocate an initialize a new internal node.
+ */
+static struct node *
+alloc_node(struct node *parent)
+{
+ struct node *node;
+ int bitnum;
+
+ node = malloc(sizeof(*node));
+ node->left = node->right = NULL;
+ node->parent = parent;
+ node->leftnode = NODE;
+ node->rightnode = NODE;
+ node->keybits = 0;
+ node->keymask = 0;
+ node->mark = 0;
+ node->index = 0;
+ node->offset = -1;
+ node->size = 4;
+
+ if (node->parent) {
+ bitnum = parent->bitnum;
+ if ((bitnum & 7) == 0) {
+ node->bitnum = bitnum + 7 + 8;
+ node->nextbyte = 1;
+ } else {
+ node->bitnum = bitnum - 1;
+ node->nextbyte = 0;
+ }
+ } else {
+ node->bitnum = 7;
+ node->nextbyte = 0;
+ }
+
+ return node;
+}
+
+/*
+ * Insert a new leaf into the tree, and collapse any subtrees that are
+ * fully populated and end in identical leaves. A nextbyte tagged
+ * internal node will not be removed to preserve the tree's integrity.
+ * Note that due to the structure of utf8, no nextbyte tagged node
+ * will be a candidate for removal.
+ */
+static int
+insert(struct tree *tree, char *key, int keylen, void *leaf)
+{
+ struct node *node;
+ struct node *parent;
+ void **cursor;
+ int keybits;
+
+ assert(keylen >= 1 && keylen <= 4);
+
+ node = NULL;
+ cursor = &tree->root;
+ keybits = 8 * keylen;
+
+ /* Insert, creating path along the way. */
+ while (keybits) {
+ if (!*cursor)
+ *cursor = alloc_node(node);
+ node = *cursor;
+ if (node->nextbyte)
+ key++;
+ if (*key & (1 << (node->bitnum & 7)))
+ cursor = &node->right;
+ else
+ cursor = &node->left;
+ keybits--;
+ }
+ *cursor = leaf;
+
+ /* Merge subtrees if possible. */
+ while (node) {
+ if (*key & (1 << (node->bitnum & 7)))
+ node->rightnode = LEAF;
+ else
+ node->leftnode = LEAF;
+ if (node->nextbyte)
+ break;
+ if (node->leftnode == NODE || node->rightnode == NODE)
+ break;
+ assert(node->left);
+ assert(node->right);
+ /* Compare */
+ if (! tree->leaf_equal(node->left, node->right))
+ break;
+ /* Keep left, drop right leaf. */
+ leaf = node->left;
+ /* Check in parent */
+ parent = node->parent;
+ if (!parent) {
+ /* root of tree! */
+ tree->root = leaf;
+ tree->childnode = LEAF;
+ } else if (parent->left == node) {
+ parent->left = leaf;
+ parent->leftnode = LEAF;
+ if (parent->right) {
+ parent->keymask = 0;
+ parent->keybits = 0;
+ } else {
+ parent->keymask |= (1 << node->bitnum);
+ }
+ } else if (parent->right == node) {
+ parent->right = leaf;
+ parent->rightnode = LEAF;
+ if (parent->left) {
+ parent->keymask = 0;
+ parent->keybits = 0;
+ } else {
+ parent->keymask |= (1 << node->bitnum);
+ parent->keybits |= (1 << node->bitnum);
+ }
+ } else {
+ /* internal tree error */
+ assert(0);
+ }
+ free(node);
+ node = parent;
+ }
+
+ /* Propagate keymasks up along singleton chains. */
+ while (node) {
+ parent = node->parent;
+ if (!parent)
+ break;
+ /* Nix the mask for parents with two children. */
+ if (node->keymask == 0) {
+ parent->keymask = 0;
+ parent->keybits = 0;
+ } else if (parent->left && parent->right) {
+ parent->keymask = 0;
+ parent->keybits = 0;
+ } else {
+ assert((parent->keymask & node->keymask) == 0);
+ parent->keymask |= node->keymask;
+ parent->keymask |= (1 << parent->bitnum);
+ parent->keybits |= node->keybits;
+ if (parent->right)
+ parent->keybits |= (1 << parent->bitnum);
+ }
+ node = parent;
+ }
+
+ return 0;
+}
+
+/*
+ * Prune internal nodes.
+ *
+ * Fully populated subtrees that end at the same leaf have already
+ * been collapsed. There are still internal nodes that have for both
+ * their left and right branches a sequence of singletons that make
+ * identical choices and end in identical leaves. The keymask and
+ * keybits collected in the nodes describe the choices made in these
+ * singleton chains. When they are identical for the left and right
+ * branch of a node, and the two leaves comare identical, the node in
+ * question can be removed.
+ *
+ * Note that nodes with the nextbyte tag set will not be removed by
+ * this to ensure tree integrity. Note as well that the structure of
+ * utf8 ensures that these nodes would not have been candidates for
+ * removal in any case.
+ */
+static void
+prune(struct tree *tree)
+{
+ struct node *node;
+ struct node *left;
+ struct node *right;
+ struct node *parent;
+ void *leftleaf;
+ void *rightleaf;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int count;
+
+ if (verbose > 0)
+ printf("Pruning %s_%x\n", tree->type, tree->maxage);
+
+ count = 0;
+ if (tree->childnode == LEAF)
+ return;
+ if (!tree->root)
+ return;
+
+ leftmask = rightmask = 0;
+ node = tree->root;
+ while (node) {
+ if (node->nextbyte)
+ goto advance;
+ if (node->leftnode == LEAF)
+ goto advance;
+ if (node->rightnode == LEAF)
+ goto advance;
+ if (!node->left)
+ goto advance;
+ if (!node->right)
+ goto advance;
+ left = node->left;
+ right = node->right;
+ if (left->keymask == 0)
+ goto advance;
+ if (right->keymask == 0)
+ goto advance;
+ if (left->keymask != right->keymask)
+ goto advance;
+ if (left->keybits != right->keybits)
+ goto advance;
+ leftleaf = NULL;
+ while (!leftleaf) {
+ assert(left->left || left->right);
+ if (left->leftnode == LEAF)
+ leftleaf = left->left;
+ else if (left->rightnode == LEAF)
+ leftleaf = left->right;
+ else if (left->left)
+ left = left->left;
+ else if (left->right)
+ left = left->right;
+ else
+ assert(0);
+ }
+ rightleaf = NULL;
+ while (!rightleaf) {
+ assert(right->left || right->right);
+ if (right->leftnode == LEAF)
+ rightleaf = right->left;
+ else if (right->rightnode == LEAF)
+ rightleaf = right->right;
+ else if (right->left)
+ right = right->left;
+ else if (right->right)
+ right = right->right;
+ else
+ assert(0);
+ }
+ if (! tree->leaf_equal(leftleaf, rightleaf))
+ goto advance;
+ /*
+ * This node has identical singleton-only subtrees.
+ * Remove it.
+ */
+ parent = node->parent;
+ left = node->left;
+ right = node->right;
+ if (parent->left == node)
+ parent->left = left;
+ else if (parent->right == node)
+ parent->right = left;
+ else
+ assert(0);
+ left->parent = parent;
+ left->keymask |= (1 << node->bitnum);
+ node->left = NULL;
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ if (node->leftnode == NODE && node->left) {
+ left = node->left;
+ free(node);
+ count++;
+ node = left;
+ } else if (node->rightnode == NODE && node->right) {
+ right = node->right;
+ free(node);
+ count++;
+ node = right;
+ } else {
+ node = NULL;
+ }
+ }
+ /* Propagate keymasks up along singleton chains. */
+ node = parent;
+ /* Force re-check */
+ bitmask = 1 << node->bitnum;
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ for (;;) {
+ if (node->left && node->right)
+ break;
+ if (node->left) {
+ left = node->left;
+ node->keymask |= left->keymask;
+ node->keybits |= left->keybits;
+ }
+ if (node->right) {
+ right = node->right;
+ node->keymask |= right->keymask;
+ node->keybits |= right->keybits;
+ }
+ node->keymask |= (1 << node->bitnum);
+ node = node->parent;
+ /* Force re-check */
+ bitmask = 1 << node->bitnum;
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ }
+ advance:
+ bitmask = 1 << node->bitnum;
+ if ((leftmask & bitmask) == 0 &&
+ node->leftnode == NODE &&
+ node->left) {
+ leftmask |= bitmask;
+ node = node->left;
+ } else if ((rightmask & bitmask) == 0 &&
+ node->rightnode == NODE &&
+ node->right) {
+ rightmask |= bitmask;
+ node = node->right;
+ } else {
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ }
+ }
+ if (verbose > 0)
+ printf("Pruned %d nodes\n", count);
+}
+
+/*
+ * Mark the nodes in the tree that lead to leaves that must be
+ * emitted.
+ */
+static void
+mark_nodes(struct tree *tree)
+{
+ struct node *node;
+ struct node *n;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int marked;
+
+ marked = 0;
+ if (verbose > 0)
+ printf("Marking %s_%x\n", tree->type, tree->maxage);
+ if (tree->childnode == LEAF)
+ goto done;
+
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if ((leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ if (tree->leaf_mark(node->left)) {
+ n = node;
+ while (n && !n->mark) {
+ marked++;
+ n->mark = 1;
+ n = n->parent;
+ }
+ }
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ node = node->left;
+ continue;
+ }
+ }
+ if ((rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ if (tree->leaf_mark(node->right)) {
+ n = node;
+ while (n && !n->mark) {
+ marked++;
+ n->mark = 1;
+ n = n->parent;
+ }
+ }
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ node = node->right;
+ continue;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ }
+
+ /* second pass: left siblings and singletons */
+
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if ((leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ if (tree->leaf_mark(node->left)) {
+ n = node;
+ while (n && !n->mark) {
+ marked++;
+ n->mark = 1;
+ n = n->parent;
+ }
+ }
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ node = node->left;
+ if (!node->mark && node->parent->mark) {
+ marked++;
+ node->mark = 1;
+ }
+ continue;
+ }
+ }
+ if ((rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ if (tree->leaf_mark(node->right)) {
+ n = node;
+ while (n && !n->mark) {
+ marked++;
+ n->mark = 1;
+ n = n->parent;
+ }
+ }
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ node = node->right;
+ if (!node->mark && node->parent->mark &&
+ !node->parent->left) {
+ marked++;
+ node->mark = 1;
+ }
+ continue;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ }
+done:
+ if (verbose > 0)
+ printf("Marked %d nodes\n", marked);
+}
+
+/*
+ * Compute the index of each node and leaf, which is the offset in the
+ * emitted trie. These value must be pre-computed because relative
+ * offsets between nodes are used to navigate the tree.
+ */
+static int
+index_nodes(struct tree *tree, int index)
+{
+ struct node *node;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int count;
+ int indent;
+
+ /* Align to a cache line (or half a cache line?). */
+ while (index % 64)
+ index++;
+ tree->index = index;
+ indent = 1;
+ count = 0;
+
+ if (verbose > 0)
+ printf("Indexing %s_%x: %d", tree->type, tree->maxage, index);
+ if (tree->childnode == LEAF) {
+ index += tree->leaf_size(tree->root);
+ goto done;
+ }
+
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ if (!node->mark)
+ goto skip;
+ count++;
+ if (node->index != index)
+ node->index = index;
+ index += node->size;
+skip:
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if (node->mark && (leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ *tree->leaf_index(tree, node->left) =
+ index;
+ index += tree->leaf_size(node->left);
+ count++;
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ indent += 1;
+ node = node->left;
+ break;
+ }
+ }
+ if (node->mark && (rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ *tree->leaf_index(tree, node->right) = index;
+ index += tree->leaf_size(node->right);
+ count++;
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ indent += 1;
+ node = node->right;
+ break;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ indent -= 1;
+ }
+ }
+done:
+ /* Round up to a multiple of 16 */
+ while (index % 16)
+ index++;
+ if (verbose > 0)
+ printf("Final index %d\n", index);
+ return index;
+}
+
+/*
+ * Compute the size of nodes and leaves. We start by assuming that
+ * each node needs to store a three-byte offset. The indexes of the
+ * nodes are calculated based on that, and then this function is
+ * called to see if the sizes of some nodes can be reduced. This is
+ * repeated until no more changes are seen.
+ */
+static int
+size_nodes(struct tree *tree)
+{
+ struct tree *next;
+ struct node *node;
+ struct node *right;
+ struct node *n;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ unsigned int pathbits;
+ unsigned int pathmask;
+ int changed;
+ int offset;
+ int size;
+ int indent;
+
+ indent = 1;
+ changed = 0;
+ size = 0;
+
+ if (verbose > 0)
+ printf("Sizing %s_%x", tree->type, tree->maxage);
+ if (tree->childnode == LEAF)
+ goto done;
+
+ assert(tree->childnode == NODE);
+ pathbits = 0;
+ pathmask = 0;
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ if (!node->mark)
+ goto skip;
+ offset = 0;
+ if (!node->left || !node->right) {
+ size = 1;
+ } else {
+ if (node->rightnode == NODE) {
+ right = node->right;
+ next = tree->next;
+ while (!right->mark) {
+ assert(next);
+ n = next->root;
+ while (n->bitnum != node->bitnum) {
+ if (pathbits & (1<<n->bitnum))
+ n = n->right;
+ else
+ n = n->left;
+ }
+ n = n->right;
+ assert(right->bitnum == n->bitnum);
+ right = n;
+ next = next->next;
+ }
+ offset = right->index - node->index;
+ } else {
+ offset = *tree->leaf_index(tree, node->right);
+ offset -= node->index;
+ }
+ assert(offset >= 0);
+ assert(offset <= 0xffffff);
+ if (offset <= 0xff) {
+ size = 2;
+ } else if (offset <= 0xffff) {
+ size = 3;
+ } else { /* offset <= 0xffffff */
+ size = 4;
+ }
+ }
+ if (node->size != size || node->offset != offset) {
+ node->size = size;
+ node->offset = offset;
+ changed++;
+ }
+skip:
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ pathmask |= bitmask;
+ if (node->mark && (leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ indent += 1;
+ node = node->left;
+ break;
+ }
+ }
+ if (node->mark && (rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ pathbits |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ indent += 1;
+ node = node->right;
+ break;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ pathmask &= ~bitmask;
+ pathbits &= ~bitmask;
+ node = node->parent;
+ indent -= 1;
+ }
+ }
+done:
+ if (verbose > 0)
+ printf("Found %d changes\n", changed);
+ return changed;
+}
+
+/*
+ * Emit a trie for the given tree into the data array.
+ */
+static void
+emit(struct tree *tree, unsigned char *data)
+{
+ struct node *node;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int offlen;
+ int offset;
+ int index;
+ int indent;
+ unsigned char byte;
+
+ index = tree->index;
+ data += index;
+ indent = 1;
+ if (verbose > 0)
+ printf("Emitting %s_%x\n", tree->type, tree->maxage);
+ if (tree->childnode == LEAF) {
+ assert(tree->root);
+ tree->leaf_emit(tree->root, data);
+ return;
+ }
+
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ if (!node->mark)
+ goto skip;
+ assert(node->offset != -1);
+ assert(node->index == index);
+
+ byte = 0;
+ if (node->nextbyte)
+ byte |= NEXTBYTE;
+ byte |= (node->bitnum & BITNUM);
+ if (node->left && node->right) {
+ if (node->leftnode == NODE)
+ byte |= LEFTNODE;
+ if (node->rightnode == NODE)
+ byte |= RIGHTNODE;
+ if (node->offset <= 0xff)
+ offlen = 1;
+ else if (node->offset <= 0xffff)
+ offlen = 2;
+ else
+ offlen = 3;
+ offset = node->offset;
+ byte |= offlen << OFFLEN_SHIFT;
+ *data++ = byte;
+ index++;
+ while (offlen--) {
+ *data++ = offset & 0xff;
+ index++;
+ offset >>= 8;
+ }
+ } else if (node->left) {
+ if (node->leftnode == NODE)
+ byte |= TRIENODE;
+ *data++ = byte;
+ index++;
+ } else if (node->right) {
+ byte |= RIGHTNODE;
+ if (node->rightnode == NODE)
+ byte |= TRIENODE;
+ *data++ = byte;
+ index++;
+ } else {
+ assert(0);
+ }
+skip:
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if (node->mark && (leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ data = tree->leaf_emit(node->left,
+ data);
+ index += tree->leaf_size(node->left);
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ indent += 1;
+ node = node->left;
+ break;
+ }
+ }
+ if (node->mark && (rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ data = tree->leaf_emit(node->right,
+ data);
+ index += tree->leaf_size(node->right);
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ indent += 1;
+ node = node->right;
+ break;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ indent -= 1;
+ }
+ }
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Unicode data.
+ *
+ * We need to keep track of the Canonical Combining Class, the Age,
+ * and decompositions for a code point.
+ *
+ * For the Age, we store the index into the ages table. Effectively
+ * this is a generation number that the table maps to a unicode
+ * version.
+ *
+ * The correction field is used to indicate that this entry is in the
+ * corrections array, which contains decompositions that were
+ * corrected in later revisions. The value of the correction field is
+ * the Unicode version in which the mapping was corrected.
+ */
+struct unicode_data {
+ unsigned int code;
+ int ccc;
+ int gen;
+ int correction;
+ unsigned int *utf32nfkdi;
+ unsigned int *utf32nfkdicf;
+ char *utf8nfkdi;
+ char *utf8nfkdicf;
+};
+
+struct unicode_data unicode_data[0x110000];
+struct unicode_data *corrections;
+int corrections_count;
+
+struct tree *nfkdi_tree;
+struct tree *nfkdicf_tree;
+
+struct tree *trees;
+int trees_count;
+
+/*
+ * Check the corrections array to see if this entry was corrected at
+ * some point.
+ */
+static struct unicode_data *
+corrections_lookup(struct unicode_data *u)
+{
+ int i;
+
+ for (i = 0; i != corrections_count; i++)
+ if (u->code == corrections[i].code)
+ return &corrections[i];
+ return u;
+}
+
+static int
+nfkdi_equal(void *l, void *r)
+{
+ struct unicode_data *left = l;
+ struct unicode_data *right = r;
+
+ if (left->gen != right->gen)
+ return 0;
+ if (left->ccc != right->ccc)
+ return 0;
+ if (left->utf8nfkdi && right->utf8nfkdi &&
+ strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0)
+ return 1;
+ if (left->utf8nfkdi || right->utf8nfkdi)
+ return 0;
+ return 1;
+}
+
+static int
+nfkdicf_equal(void *l, void *r)
+{
+ struct unicode_data *left = l;
+ struct unicode_data *right = r;
+
+ if (left->gen != right->gen)
+ return 0;
+ if (left->ccc != right->ccc)
+ return 0;
+ if (left->utf8nfkdicf && right->utf8nfkdicf &&
+ strcmp(left->utf8nfkdicf, right->utf8nfkdicf) == 0)
+ return 1;
+ if (left->utf8nfkdicf && right->utf8nfkdicf)
+ return 0;
+ if (left->utf8nfkdicf || right->utf8nfkdicf)
+ return 0;
+ if (left->utf8nfkdi && right->utf8nfkdi &&
+ strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0)
+ return 1;
+ if (left->utf8nfkdi || right->utf8nfkdi)
+ return 0;
+ return 1;
+}
+
+static void
+nfkdi_print(void *l, int indent)
+{
+ struct unicode_data *leaf = l;
+
+ printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
+ leaf->code, leaf->ccc, leaf->gen);
+ if (leaf->utf8nfkdi)
+ printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
+ printf("\n");
+}
+
+static void
+nfkdicf_print(void *l, int indent)
+{
+ struct unicode_data *leaf = l;
+
+ printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
+ leaf->code, leaf->ccc, leaf->gen);
+ if (leaf->utf8nfkdicf)
+ printf(" nfkdicf \"%s\"", (const char*)leaf->utf8nfkdicf);
+ else if (leaf->utf8nfkdi)
+ printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
+ printf("\n");
+}
+
+static int
+nfkdi_mark(void *l)
+{
+ return 1;
+}
+
+static int
+nfkdicf_mark(void *l)
+{
+ struct unicode_data *leaf = l;
+
+ if (leaf->utf8nfkdicf)
+ return 1;
+ return 0;
+}
+
+static int
+correction_mark(void *l)
+{
+ struct unicode_data *leaf = l;
+
+ return leaf->correction;
+}
+
+static int
+nfkdi_size(void *l)
+{
+ struct unicode_data *leaf = l;
+
+ int size = 2;
+ if (leaf->utf8nfkdi)
+ size += strlen(leaf->utf8nfkdi) + 1;
+ return size;
+}
+
+static int
+nfkdicf_size(void *l)
+{
+ struct unicode_data *leaf = l;
+
+ int size = 2;
+ if (leaf->utf8nfkdicf)
+ size += strlen(leaf->utf8nfkdicf) + 1;
+ else if (leaf->utf8nfkdi)
+ size += strlen(leaf->utf8nfkdi) + 1;
+ return size;
+}
+
+static int *
+nfkdi_index(struct tree *tree, void *l)
+{
+ struct unicode_data *leaf = l;
+
+ return &tree->leafindex[leaf->code];
+}
+
+static int *
+nfkdicf_index(struct tree *tree, void *l)
+{
+ struct unicode_data *leaf = l;
+
+ return &tree->leafindex[leaf->code];
+}
+
+static unsigned char *
+nfkdi_emit(void *l, unsigned char *data)
+{
+ struct unicode_data *leaf = l;
+ unsigned char *s;
+
+ *data++ = leaf->gen;
+ if (leaf->utf8nfkdi) {
+ *data++ = DECOMPOSE;
+ s = (unsigned char*)leaf->utf8nfkdi;
+ while ((*data++ = *s++) != 0)
+ ;
+ } else {
+ *data++ = leaf->ccc;
+ }
+ return data;
+}
+
+static unsigned char *
+nfkdicf_emit(void *l, unsigned char *data)
+{
+ struct unicode_data *leaf = l;
+ unsigned char *s;
+
+ *data++ = leaf->gen;
+ if (leaf->utf8nfkdicf) {
+ *data++ = DECOMPOSE;
+ s = (unsigned char*)leaf->utf8nfkdicf;
+ while ((*data++ = *s++) != 0)
+ ;
+ } else if (leaf->utf8nfkdi) {
+ *data++ = DECOMPOSE;
+ s = (unsigned char*)leaf->utf8nfkdi;
+ while ((*data++ = *s++) != 0)
+ ;
+ } else {
+ *data++ = leaf->ccc;
+ }
+ return data;
+}
+
+static void
+utf8_create(struct unicode_data *data)
+{
+ char utf[18*4+1];
+ char *u;
+ unsigned int *um;
+ int i;
+
+ u = utf;
+ um = data->utf32nfkdi;
+ if (um) {
+ for (i = 0; um[i]; i++)
+ u += utf8key(um[i], u);
+ *u = '\0';
+ data->utf8nfkdi = strdup((char*)utf);
+ }
+ u = utf;
+ um = data->utf32nfkdicf;
+ if (um) {
+ for (i = 0; um[i]; i++)
+ u += utf8key(um[i], u);
+ *u = '\0';
+ if (!data->utf8nfkdi || strcmp(data->utf8nfkdi, (char*)utf))
+ data->utf8nfkdicf = strdup((char*)utf);
+ }
+}
+
+static void
+utf8_init(void)
+{
+ unsigned int unichar;
+ int i;
+
+ for (unichar = 0; unichar != 0x110000; unichar++)
+ utf8_create(&unicode_data[unichar]);
+
+ for (i = 0; i != corrections_count; i++)
+ utf8_create(&corrections[i]);
+}
+
+static void
+trees_init(void)
+{
+ struct unicode_data *data;
+ unsigned int maxage;
+ unsigned int nextage;
+ int count;
+ int i;
+ int j;
+
+ /* Count the number of different ages. */
+ count = 0;
+ nextage = (unsigned int)-1;
+ do {
+ maxage = nextage;
+ nextage = 0;
+ for (i = 0; i <= corrections_count; i++) {
+ data = &corrections[i];
+ if (nextage < data->correction &&
+ data->correction < maxage)
+ nextage = data->correction;
+ }
+ count++;
+ } while (nextage);
+
+ /* Two trees per age: nfkdi and nfkdicf */
+ trees_count = count * 2;
+ trees = calloc(trees_count, sizeof(struct tree));
+
+ /* Assign ages to the trees. */
+ count = trees_count;
+ nextage = (unsigned int)-1;
+ do {
+ maxage = nextage;
+ trees[--count].maxage = maxage;
+ trees[--count].maxage = maxage;
+ nextage = 0;
+ for (i = 0; i <= corrections_count; i++) {
+ data = &corrections[i];
+ if (nextage < data->correction &&
+ data->correction < maxage)
+ nextage = data->correction;
+ }
+ } while (nextage);
+
+ /* The ages assigned above are off by one. */
+ for (i = 0; i != trees_count; i++) {
+ j = 0;
+ while (ages[j] < trees[i].maxage)
+ j++;
+ trees[i].maxage = ages[j-1];
+ }
+
+ /* Set up the forwarding between trees. */
+ trees[trees_count-2].next = &trees[trees_count-1];
+ trees[trees_count-1].leaf_mark = nfkdi_mark;
+ trees[trees_count-2].leaf_mark = nfkdicf_mark;
+ for (i = 0; i != trees_count-2; i += 2) {
+ trees[i].next = &trees[trees_count-2];
+ trees[i].leaf_mark = correction_mark;
+ trees[i+1].next = &trees[trees_count-1];
+ trees[i+1].leaf_mark = correction_mark;
+ }
+
+ /* Assign the callouts. */
+ for (i = 0; i != trees_count; i += 2) {
+ trees[i].type = "nfkdicf";
+ trees[i].leaf_equal = nfkdicf_equal;
+ trees[i].leaf_print = nfkdicf_print;
+ trees[i].leaf_size = nfkdicf_size;
+ trees[i].leaf_index = nfkdicf_index;
+ trees[i].leaf_emit = nfkdicf_emit;
+
+ trees[i+1].type = "nfkdi";
+ trees[i+1].leaf_equal = nfkdi_equal;
+ trees[i+1].leaf_print = nfkdi_print;
+ trees[i+1].leaf_size = nfkdi_size;
+ trees[i+1].leaf_index = nfkdi_index;
+ trees[i+1].leaf_emit = nfkdi_emit;
+ }
+
+ /* Finish init. */
+ for (i = 0; i != trees_count; i++)
+ trees[i].childnode = NODE;
+}
+
+static void
+trees_populate(void)
+{
+ struct unicode_data *data;
+ unsigned int unichar;
+ char keyval[4];
+ int keylen;
+ int i;
+
+ for (i = 0; i != trees_count; i++) {
+ if (verbose > 0) {
+ printf("Populating %s_%x\n",
+ trees[i].type, trees[i].maxage);
+ }
+ for (unichar = 0; unichar != 0x110000; unichar++) {
+ if (unicode_data[unichar].gen < 0)
+ continue;
+ keylen = utf8key(unichar, keyval);
+ data = corrections_lookup(&unicode_data[unichar]);
+ if (data->correction <= trees[i].maxage)
+ data = &unicode_data[unichar];
+ insert(&trees[i], keyval, keylen, data);
+ }
+ }
+}
+
+static void
+trees_reduce(void)
+{
+ int i;
+ int size;
+ int changed;
+
+ for (i = 0; i != trees_count; i++)
+ prune(&trees[i]);
+ for (i = 0; i != trees_count; i++)
+ mark_nodes(&trees[i]);
+ do {
+ size = 0;
+ for (i = 0; i != trees_count; i++)
+ size = index_nodes(&trees[i], size);
+ changed = 0;
+ for (i = 0; i != trees_count; i++)
+ changed += size_nodes(&trees[i]);
+ } while (changed);
+
+ utf8data = calloc(size, 1);
+ utf8data_size = size;
+ for (i = 0; i != trees_count; i++)
+ emit(&trees[i], utf8data);
+
+ if (verbose > 0) {
+ for (i = 0; i != trees_count; i++) {
+ printf("%s_%x idx %d\n",
+ trees[i].type, trees[i].maxage, trees[i].index);
+ }
+ }
+
+ nfkdi = utf8data + trees[trees_count-1].index;
+ nfkdicf = utf8data + trees[trees_count-2].index;
+
+ nfkdi_tree = &trees[trees_count-1];
+ nfkdicf_tree = &trees[trees_count-2];
+}
+
+static void
+verify(struct tree *tree)
+{
+ struct unicode_data *data;
+ utf8leaf_t *leaf;
+ unsigned int unichar;
+ char key[4];
+ int report;
+ int nocf;
+
+ if (verbose > 0)
+ printf("Verifying %s_%x\n", tree->type, tree->maxage);
+ nocf = strcmp(tree->type, "nfkdicf");
+
+ for (unichar = 0; unichar != 0x110000; unichar++) {
+ report = 0;
+ data = corrections_lookup(&unicode_data[unichar]);
+ if (data->correction <= tree->maxage)
+ data = &unicode_data[unichar];
+ utf8key(unichar, key);
+ leaf = utf8lookup(tree, key);
+ if (!leaf) {
+ if (data->gen != -1)
+ report++;
+ if (unichar < 0xd800 || unichar > 0xdfff)
+ report++;
+ } else {
+ if (unichar >= 0xd800 && unichar <= 0xdfff)
+ report++;
+ if (data->gen == -1)
+ report++;
+ if (data->gen != LEAF_GEN(leaf))
+ report++;
+ if (LEAF_CCC(leaf) == DECOMPOSE) {
+ if (nocf) {
+ if (!data->utf8nfkdi) {
+ report++;
+ } else if (strcmp(data->utf8nfkdi,
+ LEAF_STR(leaf))) {
+ report++;
+ }
+ } else {
+ if (!data->utf8nfkdicf &&
+ !data->utf8nfkdi) {
+ report++;
+ } else if (data->utf8nfkdicf) {
+ if (strcmp(data->utf8nfkdicf,
+ LEAF_STR(leaf)))
+ report++;
+ } else if (strcmp(data->utf8nfkdi,
+ LEAF_STR(leaf))) {
+ report++;
+ }
+ }
+ } else if (data->ccc != LEAF_CCC(leaf)) {
+ report++;
+ }
+ }
+ if (report) {
+ printf("%X code %X gen %d ccc %d"
+ " nfdki -> \"%s\"",
+ unichar, data->code, data->gen,
+ data->ccc,
+ data->utf8nfkdi);
+ if (leaf) {
+ printf(" age %d ccc %d"
+ " nfdki -> \"%s\"\n",
+ LEAF_GEN(leaf),
+ LEAF_CCC(leaf),
+ LEAF_CCC(leaf) == DECOMPOSE ?
+ LEAF_STR(leaf) : "");
+ }
+ printf("\n");
+ }
+ }
+}
+
+static void
+trees_verify(void)
+{
+ int i;
+
+ for (i = 0; i != trees_count; i++)
+ verify(&trees[i]);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+help(void)
+{
+ printf("Usage: %s [options]\n", argv0);
+ printf("\n");
+ printf("This program creates an a data trie used for parsing and\n");
+ printf("normalization of UTF-8 strings. The trie is derived from\n");
+ printf("a set of input files from the Unicode character database\n");
+ printf("found at: http://www.unicode.org/Public/UCD/latest/ucd/\n");
+ printf("\n");
+ printf("The generated tree supports two normalization forms:\n");
+ printf("\n");
+ printf("\tnfkdi:\n");
+ printf("\t- Apply unicode normalization form NFKD.\n");
+ printf("\t- Remove any Default_Ignorable_Code_Point.\n");
+ printf("\n");
+ printf("\tnfkdicf:\n");
+ printf("\t- Apply unicode normalization form NFKD.\n");
+ printf("\t- Remove any Default_Ignorable_Code_Point.\n");
+ printf("\t- Apply a full casefold (C + F).\n");
+ printf("\n");
+ printf("These forms were chosen as being most useful when dealing\n");
+ printf("with file names: NFKD catches most cases where characters\n");
+ printf("should be considered equivalent. The ignorables are mostly\n");
+ printf("invisible, making names hard to type.\n");
+ printf("\n");
+ printf("The options to specify the files to be used are listed\n");
+ printf("below with their default values, which are the names used\n");
+ printf("by version 7.0.0 of the Unicode Character Database.\n");
+ printf("\n");
+ printf("The input files:\n");
+ printf("\t-a %s\n", AGE_NAME);
+ printf("\t-c %s\n", CCC_NAME);
+ printf("\t-p %s\n", PROP_NAME);
+ printf("\t-d %s\n", DATA_NAME);
+ printf("\t-f %s\n", FOLD_NAME);
+ printf("\t-n %s\n", NORM_NAME);
+ printf("\n");
+ printf("Additionally, the generated tables are tested using:\n");
+ printf("\t-t %s\n", TEST_NAME);
+ printf("\n");
+ printf("Finally, the output file:\n");
+ printf("\t-o %s\n", UTF8_NAME);
+ printf("\n");
+}
+
+static void
+usage(void)
+{
+ help();
+ exit(1);
+}
+
+static void
+open_fail(const char *name, int error)
+{
+ printf("Error %d opening %s: %s\n", error, name, strerror(error));
+ exit(1);
+}
+
+static void
+file_fail(const char *filename)
+{
+ printf("Error parsing %s\n", filename);
+ exit(1);
+}
+
+static void
+line_fail(const char *filename, const char *line)
+{
+ printf("Error parsing %s:%s\n", filename, line);
+ exit(1);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+print_utf32(unsigned int *utf32str)
+{
+ int i;
+
+ for (i = 0; utf32str[i]; i++)
+ printf(" %X", utf32str[i]);
+}
+
+static void
+print_utf32nfkdi(unsigned int unichar)
+{
+ printf(" %X ->", unichar);
+ print_utf32(unicode_data[unichar].utf32nfkdi);
+ printf("\n");
+}
+
+static void
+print_utf32nfkdicf(unsigned int unichar)
+{
+ printf(" %X ->", unichar);
+ print_utf32(unicode_data[unichar].utf32nfkdicf);
+ printf("\n");
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+age_init(void)
+{
+ FILE *file;
+ unsigned int first;
+ unsigned int last;
+ unsigned int unichar;
+ unsigned int major;
+ unsigned int minor;
+ unsigned int revision;
+ int gen;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", age_name);
+
+ file = fopen(age_name, "r");
+ if (!file)
+ open_fail(age_name, errno);
+ count = 0;
+
+ gen = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "# Age=V%d_%d_%d",
+ &major, &minor, &revision);
+ if (ret == 3) {
+ ages_count++;
+ if (verbose > 1)
+ printf(" Age V%d_%d_%d\n",
+ major, minor, revision);
+ if (!age_valid(major, minor, revision))
+ line_fail(age_name, line);
+ continue;
+ }
+ ret = sscanf(line, "# Age=V%d_%d", &major, &minor);
+ if (ret == 2) {
+ ages_count++;
+ if (verbose > 1)
+ printf(" Age V%d_%d\n", major, minor);
+ if (!age_valid(major, minor, 0))
+ line_fail(age_name, line);
+ continue;
+ }
+ }
+
+ /* We must have found something above. */
+ if (verbose > 1)
+ printf("%d age entries\n", ages_count);
+ if (ages_count == 0 || ages_count > MAXGEN)
+ file_fail(age_name);
+
+ /* There is a 0 entry. */
+ ages_count++;
+ ages = calloc(ages_count + 1, sizeof(*ages));
+ /* And a guard entry. */
+ ages[ages_count] = (unsigned int)-1;
+
+ rewind(file);
+ count = 0;
+ gen = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "# Age=V%d_%d_%d",
+ &major, &minor, &revision);
+ if (ret == 3) {
+ ages[++gen] =
+ UNICODE_AGE(major, minor, revision);
+ if (verbose > 1)
+ printf(" Age V%d_%d_%d = gen %d\n",
+ major, minor, revision, gen);
+ if (!age_valid(major, minor, revision))
+ line_fail(age_name, line);
+ continue;
+ }
+ ret = sscanf(line, "# Age=V%d_%d", &major, &minor);
+ if (ret == 2) {
+ ages[++gen] = UNICODE_AGE(major, minor, 0);
+ if (verbose > 1)
+ printf(" Age V%d_%d = %d\n",
+ major, minor, gen);
+ if (!age_valid(major, minor, 0))
+ line_fail(age_name, line);
+ continue;
+ }
+ ret = sscanf(line, "%X..%X ; %d.%d #",
+ &first, &last, &major, &minor);
+ if (ret == 4) {
+ for (unichar = first; unichar <= last; unichar++)
+ unicode_data[unichar].gen = gen;
+ count += 1 + last - first;
+ if (verbose > 1)
+ printf(" %X..%X gen %d\n", first, last, gen);
+ if (!utf32valid(first) || !utf32valid(last))
+ line_fail(age_name, line);
+ continue;
+ }
+ ret = sscanf(line, "%X ; %d.%d #", &unichar, &major, &minor);
+ if (ret == 3) {
+ unicode_data[unichar].gen = gen;
+ count++;
+ if (verbose > 1)
+ printf(" %X gen %d\n", unichar, gen);
+ if (!utf32valid(unichar))
+ line_fail(age_name, line);
+ continue;
+ }
+ }
+ unicode_maxage = ages[gen];
+ fclose(file);
+
+ /* Nix surrogate block */
+ if (verbose > 1)
+ printf(" Removing surrogate block D800..DFFF\n");
+ for (unichar = 0xd800; unichar <= 0xdfff; unichar++)
+ unicode_data[unichar].gen = -1;
+
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(age_name);
+}
+
+static void
+ccc_init(void)
+{
+ FILE *file;
+ unsigned int first;
+ unsigned int last;
+ unsigned int unichar;
+ unsigned int value;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", ccc_name);
+
+ file = fopen(ccc_name, "r");
+ if (!file)
+ open_fail(ccc_name, errno);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X..%X ; %d #", &first, &last, &value);
+ if (ret == 3) {
+ for (unichar = first; unichar <= last; unichar++) {
+ unicode_data[unichar].ccc = value;
+ count++;
+ }
+ if (verbose > 1)
+ printf(" %X..%X ccc %d\n", first, last, value);
+ if (!utf32valid(first) || !utf32valid(last))
+ line_fail(ccc_name, line);
+ continue;
+ }
+ ret = sscanf(line, "%X ; %d #", &unichar, &value);
+ if (ret == 2) {
+ unicode_data[unichar].ccc = value;
+ count++;
+ if (verbose > 1)
+ printf(" %X ccc %d\n", unichar, value);
+ if (!utf32valid(unichar))
+ line_fail(ccc_name, line);
+ continue;
+ }
+ }
+ fclose(file);
+
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(ccc_name);
+}
+
+static void
+nfkdi_init(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ char *s;
+ unsigned int *um;
+ int count;
+ int i;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", data_name);
+ file = fopen(data_name, "r");
+ if (!file)
+ open_fail(data_name, errno);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X;%*[^;];%*[^;];%*[^;];%*[^;];%[^;];",
+ &unichar, buf0);
+ if (ret != 2)
+ continue;
+ if (!utf32valid(unichar))
+ line_fail(data_name, line);
+
+ s = buf0;
+ /* skip over <tag> */
+ if (*s == '<')
+ while (*s++ != ' ')
+ ;
+ /* decode the decomposition into UTF-32 */
+ i = 0;
+ while (*s) {
+ mapping[i] = strtoul(s, &s, 16);
+ if (!utf32valid(mapping[i]))
+ line_fail(data_name, line);
+ i++;
+ }
+ mapping[i++] = 0;
+
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdi = um;
+
+ if (verbose > 1)
+ print_utf32nfkdi(unichar);
+ count++;
+ }
+ fclose(file);
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(data_name);
+}
+
+static void
+nfkdicf_init(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ char status;
+ char *s;
+ unsigned int *um;
+ int i;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", fold_name);
+ file = fopen(fold_name, "r");
+ if (!file)
+ open_fail(fold_name, errno);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X; %c; %[^;];", &unichar, &status, buf0);
+ if (ret != 3)
+ continue;
+ if (!utf32valid(unichar))
+ line_fail(fold_name, line);
+ /* Use the C+F casefold. */
+ if (status != 'C' && status != 'F')
+ continue;
+ s = buf0;
+ if (*s == '<')
+ while (*s++ != ' ')
+ ;
+ i = 0;
+ while (*s) {
+ mapping[i] = strtoul(s, &s, 16);
+ if (!utf32valid(mapping[i]))
+ line_fail(fold_name, line);
+ i++;
+ }
+ mapping[i++] = 0;
+
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdicf = um;
+
+ if (verbose > 1)
+ print_utf32nfkdicf(unichar);
+ count++;
+ }
+ fclose(file);
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(fold_name);
+}
+
+static void
+ignore_init(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ unsigned int first;
+ unsigned int last;
+ unsigned int *um;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", prop_name);
+ file = fopen(prop_name, "r");
+ if (!file)
+ open_fail(prop_name, errno);
+ assert(file);
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X..%X ; %s # ", &first, &last, buf0);
+ if (ret == 3) {
+ if (strcmp(buf0, "Default_Ignorable_Code_Point"))
+ continue;
+ if (!utf32valid(first) || !utf32valid(last))
+ line_fail(prop_name, line);
+ for (unichar = first; unichar <= last; unichar++) {
+ free(unicode_data[unichar].utf32nfkdi);
+ um = malloc(sizeof(unsigned int));
+ *um = 0;
+ unicode_data[unichar].utf32nfkdi = um;
+ free(unicode_data[unichar].utf32nfkdicf);
+ um = malloc(sizeof(unsigned int));
+ *um = 0;
+ unicode_data[unichar].utf32nfkdicf = um;
+ count++;
+ }
+ if (verbose > 1)
+ printf(" %X..%X Default_Ignorable_Code_Point\n",
+ first, last);
+ continue;
+ }
+ ret = sscanf(line, "%X ; %s # ", &unichar, buf0);
+ if (ret == 2) {
+ if (strcmp(buf0, "Default_Ignorable_Code_Point"))
+ continue;
+ if (!utf32valid(unichar))
+ line_fail(prop_name, line);
+ free(unicode_data[unichar].utf32nfkdi);
+ um = malloc(sizeof(unsigned int));
+ *um = 0;
+ unicode_data[unichar].utf32nfkdi = um;
+ free(unicode_data[unichar].utf32nfkdicf);
+ um = malloc(sizeof(unsigned int));
+ *um = 0;
+ unicode_data[unichar].utf32nfkdicf = um;
+ if (verbose > 1)
+ printf(" %X Default_Ignorable_Code_Point\n",
+ unichar);
+ count++;
+ continue;
+ }
+ }
+ fclose(file);
+
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(prop_name);
+}
+
+static void
+corrections_init(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ unsigned int major;
+ unsigned int minor;
+ unsigned int revision;
+ unsigned int age;
+ unsigned int *um;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ char *s;
+ int i;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", norm_name);
+ file = fopen(norm_name, "r");
+ if (!file)
+ open_fail(norm_name, errno);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #",
+ &unichar, buf0, buf1,
+ &major, &minor, &revision);
+ if (ret != 6)
+ continue;
+ if (!utf32valid(unichar) || !age_valid(major, minor, revision))
+ line_fail(norm_name, line);
+ count++;
+ }
+ corrections = calloc(count, sizeof(struct unicode_data));
+ corrections_count = count;
+ rewind(file);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #",
+ &unichar, buf0, buf1,
+ &major, &minor, &revision);
+ if (ret != 6)
+ continue;
+ if (!utf32valid(unichar) || !age_valid(major, minor, revision))
+ line_fail(norm_name, line);
+ corrections[count] = unicode_data[unichar];
+ assert(corrections[count].code == unichar);
+ age = UNICODE_AGE(major, minor, revision);
+ corrections[count].correction = age;
+
+ i = 0;
+ s = buf0;
+ while (*s) {
+ mapping[i] = strtoul(s, &s, 16);
+ if (!utf32valid(mapping[i]))
+ line_fail(norm_name, line);
+ i++;
+ }
+ mapping[i++] = 0;
+
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ corrections[count].utf32nfkdi = um;
+
+ if (verbose > 1)
+ printf(" %X -> %s -> %s V%d_%d_%d\n",
+ unichar, buf0, buf1, major, minor, revision);
+ count++;
+ }
+ fclose(file);
+
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(norm_name);
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0)
+ *
+ * AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
+ * D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
+ *
+ * SBase = 0xAC00
+ * LBase = 0x1100
+ * VBase = 0x1161
+ * TBase = 0x11A7
+ * LCount = 19
+ * VCount = 21
+ * TCount = 28
+ * NCount = 588 (VCount * TCount)
+ * SCount = 11172 (LCount * NCount)
+ *
+ * Decomposition:
+ * SIndex = s - SBase
+ *
+ * LV (Canonical/Full)
+ * LIndex = SIndex / NCount
+ * VIndex = (Sindex % NCount) / TCount
+ * LPart = LBase + LIndex
+ * VPart = VBase + VIndex
+ *
+ * LVT (Canonical)
+ * LVIndex = (SIndex / TCount) * TCount
+ * TIndex = (Sindex % TCount
+ * LVPart = LBase + LVIndex
+ * TPart = TBase + TIndex
+ *
+ * LVT (Full)
+ * LIndex = SIndex / NCount
+ * VIndex = (Sindex % NCount) / TCount
+ * TIndex = (Sindex % TCount
+ * LPart = LBase + LIndex
+ * VPart = VBase + VIndex
+ * if (TIndex == 0) {
+ * d = <LPart, VPart>
+ * } else {
+ * TPart = TBase + TIndex
+ * d = <LPart, TPart, VPart>
+ * }
+ *
+ */
+
+static void
+hangul_decompose(void)
+{
+ unsigned int sb = 0xAC00;
+ unsigned int lb = 0x1100;
+ unsigned int vb = 0x1161;
+ unsigned int tb = 0x11a7;
+ /* unsigned int lc = 19; */
+ unsigned int vc = 21;
+ unsigned int tc = 28;
+ unsigned int nc = (vc * tc);
+ /* unsigned int sc = (lc * nc); */
+ unsigned int unichar;
+ unsigned int mapping[4];
+ unsigned int *um;
+ int count;
+ int i;
+
+ if (verbose > 0)
+ printf("Decomposing hangul\n");
+ /* Hangul */
+ count = 0;
+ for (unichar = 0xAC00; unichar <= 0xD7A3; unichar++) {
+ unsigned int si = unichar - sb;
+ unsigned int li = si / nc;
+ unsigned int vi = (si % nc) / tc;
+ unsigned int ti = si % tc;
+
+ i = 0;
+ mapping[i++] = lb + li;
+ mapping[i++] = vb + vi;
+ if (ti)
+ mapping[i++] = tb + ti;
+ mapping[i++] = 0;
+
+ assert(!unicode_data[unichar].utf32nfkdi);
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdi = um;
+
+ assert(!unicode_data[unichar].utf32nfkdicf);
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdicf = um;
+
+ if (verbose > 1)
+ print_utf32nfkdi(unichar);
+
+ count++;
+ }
+ if (verbose > 0)
+ printf("Created %d entries\n", count);
+}
+
+static void
+nfkdi_decompose(void)
+{
+ unsigned int unichar;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ unsigned int *um;
+ unsigned int *dc;
+ int count;
+ int i;
+ int j;
+ int ret;
+
+ if (verbose > 0)
+ printf("Decomposing nfkdi\n");
+
+ count = 0;
+ for (unichar = 0; unichar != 0x110000; unichar++) {
+ if (!unicode_data[unichar].utf32nfkdi)
+ continue;
+ for (;;) {
+ ret = 1;
+ i = 0;
+ um = unicode_data[unichar].utf32nfkdi;
+ while (*um) {
+ dc = unicode_data[*um].utf32nfkdi;
+ if (dc) {
+ for (j = 0; dc[j]; j++)
+ mapping[i++] = dc[j];
+ ret = 0;
+ } else {
+ mapping[i++] = *um;
+ }
+ um++;
+ }
+ mapping[i++] = 0;
+ if (ret)
+ break;
+ free(unicode_data[unichar].utf32nfkdi);
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdi = um;
+ }
+ /* Add this decomposition to nfkdicf if there is no entry. */
+ if (!unicode_data[unichar].utf32nfkdicf) {
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdicf = um;
+ }
+ if (verbose > 1)
+ print_utf32nfkdi(unichar);
+ count++;
+ }
+ if (verbose > 0)
+ printf("Processed %d entries\n", count);
+}
+
+static void
+nfkdicf_decompose(void)
+{
+ unsigned int unichar;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ unsigned int *um;
+ unsigned int *dc;
+ int count;
+ int i;
+ int j;
+ int ret;
+
+ if (verbose > 0)
+ printf("Decomposing nfkdicf\n");
+ count = 0;
+ for (unichar = 0; unichar != 0x110000; unichar++) {
+ if (!unicode_data[unichar].utf32nfkdicf)
+ continue;
+ for (;;) {
+ ret = 1;
+ i = 0;
+ um = unicode_data[unichar].utf32nfkdicf;
+ while (*um) {
+ dc = unicode_data[*um].utf32nfkdicf;
+ if (dc) {
+ for (j = 0; dc[j]; j++)
+ mapping[i++] = dc[j];
+ ret = 0;
+ } else {
+ mapping[i++] = *um;
+ }
+ um++;
+ }
+ mapping[i++] = 0;
+ if (ret)
+ break;
+ free(unicode_data[unichar].utf32nfkdicf);
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdicf = um;
+ }
+ if (verbose > 1)
+ print_utf32nfkdicf(unichar);
+ count++;
+ }
+ if (verbose > 0)
+ printf("Processed %d entries\n", count);
+}
+
+/* ------------------------------------------------------------------ */
+
+int utf8agemax(struct tree *, const char *);
+int utf8nagemax(struct tree *, const char *, size_t);
+int utf8agemin(struct tree *, const char *);
+int utf8nagemin(struct tree *, const char *, size_t);
+ssize_t utf8len(struct tree *, const char *);
+ssize_t utf8nlen(struct tree *, const char *, size_t);
+struct utf8cursor;
+int utf8cursor(struct utf8cursor *, struct tree *, const char *);
+int utf8ncursor(struct utf8cursor *, struct tree *, const char *, size_t);
+int utf8byte(struct utf8cursor *);
+
+/*
+ * Use trie to scan s, touching at most len bytes.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * A non-NULL return guarantees that the UTF-8 sequence starting at s
+ * is well-formed and corresponds to a known unicode code point. The
+ * shorthand for this will be "is valid UTF-8 unicode".
+ */
+static utf8leaf_t *
+utf8nlookup(struct tree *tree, const char *s, size_t len)
+{
+ utf8trie_t *trie = utf8data + tree->index;
+ int offlen;
+ int offset;
+ int mask;
+ int node;
+
+ if (!tree)
+ return NULL;
+ if (len == 0)
+ return NULL;
+ node = 1;
+ while (node) {
+ offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT;
+ if (*trie & NEXTBYTE) {
+ if (--len == 0)
+ return NULL;
+ s++;
+ }
+ mask = 1 << (*trie & BITNUM);
+ if (*s & mask) {
+ /* Right leg */
+ if (offlen) {
+ /* Right node at offset of trie */
+ node = (*trie & RIGHTNODE);
+ offset = trie[offlen];
+ while (--offlen) {
+ offset <<= 8;
+ offset |= trie[offlen];
+ }
+ trie += offset;
+ } else if (*trie & RIGHTPATH) {
+ /* Right node after this node */
+ node = (*trie & TRIENODE);
+ trie++;
+ } else {
+ /* No right node. */
+ node = 0;
+ trie = NULL;
+ }
+ } else {
+ /* Left leg */
+ if (offlen) {
+ /* Left node after this node. */
+ node = (*trie & LEFTNODE);
+ trie += offlen + 1;
+ } else if (*trie & RIGHTPATH) {
+ /* No left node. */
+ node = 0;
+ trie = NULL;
+ } else {
+ /* Left node after this node */
+ node = (*trie & TRIENODE);
+ trie++;
+ }
+ }
+ }
+ return trie;
+}
+
+/*
+ * Use trie to scan s.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * Forwards to trie_nlookup().
+ */
+static utf8leaf_t *
+utf8lookup(struct tree *tree, const char *s)
+{
+ return utf8nlookup(tree, s, (size_t)-1);
+}
+
+/*
+ * Return the number of bytes used by the current UTF-8 sequence.
+ * Assumes the input points to the first byte of a valid UTF-8
+ * sequence.
+ */
+static inline int
+utf8clen(const char *s)
+{
+ unsigned char c = *s;
+ return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0);
+}
+
+/*
+ * Maximum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if only non-assigned code points are used.
+ */
+int
+utf8agemax(struct tree *tree, const char *s)
+{
+ utf8leaf_t *leaf;
+ int age = 0;
+ int leaf_age;
+
+ if (!tree)
+ return -1;
+ while (*s) {
+ if (!(leaf = utf8lookup(tree, s)))
+ return -1;
+ leaf_age = ages[LEAF_GEN(leaf)];
+ if (leaf_age <= tree->maxage && leaf_age > age)
+ age = leaf_age;
+ s += utf8clen(s);
+ }
+ return age;
+}
+
+/*
+ * Minimum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if non-assigned code points are used.
+ */
+int
+utf8agemin(struct tree *tree, const char *s)
+{
+ utf8leaf_t *leaf;
+ int age = tree->maxage;
+ int leaf_age;
+
+ if (!tree)
+ return -1;
+ while (*s) {
+ if (!(leaf = utf8lookup(tree, s)))
+ return -1;
+ leaf_age = ages[LEAF_GEN(leaf)];
+ if (leaf_age <= tree->maxage && leaf_age < age)
+ age = leaf_age;
+ s += utf8clen(s);
+ }
+ return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemax(struct tree *tree, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ int age = 0;
+ int leaf_age;
+
+ if (!tree)
+ return -1;
+ while (len && *s) {
+ if (!(leaf = utf8nlookup(tree, s, len)))
+ return -1;
+ leaf_age = ages[LEAF_GEN(leaf)];
+ if (leaf_age <= tree->maxage && leaf_age > age)
+ age = leaf_age;
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemin(struct tree *tree, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ int leaf_age;
+ int age = tree->maxage;
+
+ if (!tree)
+ return -1;
+ while (len && *s) {
+ if (!(leaf = utf8nlookup(tree, s, len)))
+ return -1;
+ leaf_age = ages[LEAF_GEN(leaf)];
+ if (leaf_age <= tree->maxage && leaf_age < age)
+ age = leaf_age;
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return age;
+}
+
+/*
+ * Length of the normalization of s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ *
+ * A string of Default_Ignorable_Code_Point has length 0.
+ */
+ssize_t
+utf8len(struct tree *tree, const char *s)
+{
+ utf8leaf_t *leaf;
+ size_t ret = 0;
+
+ if (!tree)
+ return -1;
+ while (*s) {
+ if (!(leaf = utf8lookup(tree, s)))
+ return -1;
+ if (ages[LEAF_GEN(leaf)] > tree->maxage)
+ ret += utf8clen(s);
+ else if (LEAF_CCC(leaf) == DECOMPOSE)
+ ret += strlen(LEAF_STR(leaf));
+ else
+ ret += utf8clen(s);
+ s += utf8clen(s);
+ }
+ return ret;
+}
+
+/*
+ * Length of the normalization of s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+ssize_t
+utf8nlen(struct tree *tree, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ size_t ret = 0;
+
+ if (!tree)
+ return -1;
+ while (len && *s) {
+ if (!(leaf = utf8nlookup(tree, s, len)))
+ return -1;
+ if (ages[LEAF_GEN(leaf)] > tree->maxage)
+ ret += utf8clen(s);
+ else if (LEAF_CCC(leaf) == DECOMPOSE)
+ ret += strlen(LEAF_STR(leaf));
+ else
+ ret += utf8clen(s);
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return ret;
+}
+
+/*
+ * Cursor structure used by the normalizer.
+ */
+struct utf8cursor {
+ struct tree *tree;
+ const char *s;
+ const char *p;
+ const char *ss;
+ const char *sp;
+ unsigned int len;
+ unsigned int slen;
+ short int ccc;
+ short int nccc;
+ unsigned int unichar;
+};
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ * s : string.
+ * len : length of s.
+ * u8c : pointer to cursor.
+ * trie : utf8trie_t to use for normalization.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8ncursor(
+ struct utf8cursor *u8c,
+ struct tree *tree,
+ const char *s,
+ size_t len)
+{
+ if (!tree)
+ return -1;
+ if (!s)
+ return -1;
+ u8c->tree = tree;
+ u8c->s = s;
+ u8c->p = NULL;
+ u8c->ss = NULL;
+ u8c->sp = NULL;
+ u8c->len = len;
+ u8c->slen = 0;
+ u8c->ccc = STOPPER;
+ u8c->nccc = STOPPER;
+ u8c->unichar = 0;
+ /* Check we didn't clobber the maximum length. */
+ if (u8c->len != len)
+ return -1;
+ /* The first byte of s may not be an utf8 continuation. */
+ if (len > 0 && (*s & 0xC0) == 0x80)
+ return -1;
+ return 0;
+}
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ * s : NUL-terminated string.
+ * u8c : pointer to cursor.
+ * trie : utf8trie_t to use for normalization.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8cursor(
+ struct utf8cursor *u8c,
+ struct tree *tree,
+ const char *s)
+{
+ return utf8ncursor(u8c, tree, s, (unsigned int)-1);
+}
+
+/*
+ * Get one byte from the normalized form of the string described by u8c.
+ *
+ * Returns the byte cast to an unsigned char on succes, and -1 on failure.
+ *
+ * The cursor keeps track of the location in the string in u8c->s.
+ * When a character is decomposed, the current location is stored in
+ * u8c->p, and u8c->s is set to the start of the decomposition. Note
+ * that bytes from a decomposition do not count against u8c->len.
+ *
+ * Characters are emitted if they match the current CCC in u8c->ccc.
+ * Hitting end-of-string while u8c->ccc == STOPPER means we're done,
+ * and the function returns 0 in that case.
+ *
+ * Sorting by CCC is done by repeatedly scanning the string. The
+ * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at
+ * the start of the scan. The first pass finds the lowest CCC to be
+ * emitted and stores it in u8c->nccc, the second pass emits the
+ * characters with this CCC and finds the next lowest CCC. This limits
+ * the number of passes to 1 + the number of different CCCs in the
+ * sequence being scanned.
+ *
+ * Therefore:
+ * u8c->p != NULL -> a decomposition is being scanned.
+ * u8c->ss != NULL -> this is a repeating scan.
+ * u8c->ccc == -1 -> this is the first scan of a repeating scan.
+ */
+int
+utf8byte(struct utf8cursor *u8c)
+{
+ utf8leaf_t *leaf;
+ int ccc;
+
+ for (;;) {
+ /* Check for the end of a decomposed character. */
+ if (u8c->p && *u8c->s == '\0') {
+ u8c->s = u8c->p;
+ u8c->p = NULL;
+ }
+
+ /* Check for end-of-string. */
+ if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) {
+ /* There is no next byte. */
+ if (u8c->ccc == STOPPER)
+ return 0;
+ /* End-of-string during a scan counts as a stopper. */
+ ccc = STOPPER;
+ goto ccc_mismatch;
+ } else if ((*u8c->s & 0xC0) == 0x80) {
+ /* This is a continuation of the current character. */
+ if (!u8c->p)
+ u8c->len--;
+ return (unsigned char)*u8c->s++;
+ }
+
+ /* Look up the data for the current character. */
+ if (u8c->p)
+ leaf = utf8lookup(u8c->tree, u8c->s);
+ else
+ leaf = utf8nlookup(u8c->tree, u8c->s, u8c->len);
+
+ /* No leaf found implies that the input is a binary blob. */
+ if (!leaf)
+ return -1;
+
+ /* Characters that are too new have CCC 0. */
+ if (ages[LEAF_GEN(leaf)] > u8c->tree->maxage) {
+ ccc = STOPPER;
+ } else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) {
+ u8c->len -= utf8clen(u8c->s);
+ u8c->p = u8c->s + utf8clen(u8c->s);
+ u8c->s = LEAF_STR(leaf);
+ /* Empty decomposition implies CCC 0. */
+ if (*u8c->s == '\0') {
+ if (u8c->ccc == STOPPER)
+ continue;
+ ccc = STOPPER;
+ goto ccc_mismatch;
+ }
+ leaf = utf8lookup(u8c->tree, u8c->s);
+ ccc = LEAF_CCC(leaf);
+ }
+ u8c->unichar = utf8code(u8c->s);
+
+ /*
+ * If this is not a stopper, then see if it updates
+ * the next canonical class to be emitted.
+ */
+ if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc)
+ u8c->nccc = ccc;
+
+ /*
+ * Return the current byte if this is the current
+ * combining class.
+ */
+ if (ccc == u8c->ccc) {
+ if (!u8c->p)
+ u8c->len--;
+ return (unsigned char)*u8c->s++;
+ }
+
+ /* Current combining class mismatch. */
+ ccc_mismatch:
+ if (u8c->nccc == STOPPER) {
+ /*
+ * Scan forward for the first canonical class
+ * to be emitted. Save the position from
+ * which to restart.
+ */
+ assert(u8c->ccc == STOPPER);
+ u8c->ccc = MINCCC - 1;
+ u8c->nccc = ccc;
+ u8c->sp = u8c->p;
+ u8c->ss = u8c->s;
+ u8c->slen = u8c->len;
+ if (!u8c->p)
+ u8c->len -= utf8clen(u8c->s);
+ u8c->s += utf8clen(u8c->s);
+ } else if (ccc != STOPPER) {
+ /* Not a stopper, and not the ccc we're emitting. */
+ if (!u8c->p)
+ u8c->len -= utf8clen(u8c->s);
+ u8c->s += utf8clen(u8c->s);
+ } else if (u8c->nccc != MAXCCC + 1) {
+ /* At a stopper, restart for next ccc. */
+ u8c->ccc = u8c->nccc;
+ u8c->nccc = MAXCCC + 1;
+ u8c->s = u8c->ss;
+ u8c->p = u8c->sp;
+ u8c->len = u8c->slen;
+ } else {
+ /* All done, proceed from here. */
+ u8c->ccc = STOPPER;
+ u8c->nccc = STOPPER;
+ u8c->sp = NULL;
+ u8c->ss = NULL;
+ u8c->slen = 0;
+ }
+ }
+}
+
+/* ------------------------------------------------------------------ */
+
+static int
+normalize_line(struct tree *tree)
+{
+ char *s;
+ char *t;
+ int c;
+ struct utf8cursor u8c;
+
+ /* First test: null-terminated string. */
+ s = buf2;
+ t = buf3;
+ if (utf8cursor(&u8c, tree, s))
+ return -1;
+ while ((c = utf8byte(&u8c)) > 0)
+ if (c != (unsigned char)*t++)
+ return -1;
+ if (c < 0)
+ return -1;
+ if (*t != 0)
+ return -1;
+
+ /* Second test: length-limited string. */
+ s = buf2;
+ /* Replace NUL with a value that will cause an error if seen. */
+ s[strlen(s) + 1] = -1;
+ t = buf3;
+ if (utf8cursor(&u8c, tree, s))
+ return -1;
+ while ((c = utf8byte(&u8c)) > 0)
+ if (c != (unsigned char)*t++)
+ return -1;
+ if (c < 0)
+ return -1;
+ if (*t != 0)
+ return -1;
+
+ return 0;
+}
+
+static void
+normalization_test(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ struct unicode_data *data;
+ char *s;
+ char *t;
+ int ret;
+ int ignorables;
+ int tests = 0;
+ int failures = 0;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", test_name);
+ /* Step one, read data from file. */
+ file = fopen(test_name, "r");
+ if (!file)
+ open_fail(test_name, errno);
+
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%[^;];%*[^;];%*[^;];%*[^;];%[^;];",
+ buf0, buf1);
+ if (ret != 2 || *line == '#')
+ continue;
+ s = buf0;
+ t = buf2;
+ while (*s) {
+ unichar = strtoul(s, &s, 16);
+ t += utf8key(unichar, t);
+ }
+ *t = '\0';
+
+ ignorables = 0;
+ s = buf1;
+ t = buf3;
+ while (*s) {
+ unichar = strtoul(s, &s, 16);
+ data = &unicode_data[unichar];
+ if (data->utf8nfkdi && !*data->utf8nfkdi)
+ ignorables = 1;
+ else
+ t += utf8key(unichar, t);
+ }
+ *t = '\0';
+
+ tests++;
+ if (normalize_line(nfkdi_tree) < 0) {
+ printf("\nline %s -> %s", buf0, buf1);
+ if (ignorables)
+ printf(" (ignorables removed)");
+ printf(" failure\n");
+ failures++;
+ }
+ }
+ fclose(file);
+ if (verbose > 0)
+ printf("Ran %d tests with %d failures\n", tests, failures);
+ if (failures)
+ file_fail(test_name);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+write_file(void)
+{
+ FILE *file;
+ int i;
+ int j;
+ int t;
+ int gen;
+
+ if (verbose > 0)
+ printf("Writing %s\n", utf8_name);
+ file = fopen(utf8_name, "w");
+ if (!file)
+ open_fail(utf8_name, errno);
+
+ fprintf(file, "/* This file is generated code, do not edit. */\n");
+ fprintf(file, "#ifndef __INCLUDED_FROM_UTF8NORM_C__\n");
+ fprintf(file, "#error Only xfs_utf8.c may include this file.\n");
+ fprintf(file, "#endif\n");
+ fprintf(file, "\n");
+ fprintf(file, "const unsigned int utf8version = %#x;\n",
+ unicode_maxage);
+ fprintf(file, "\n");
+ fprintf(file, "static const unsigned int utf8agetab[] = {\n");
+ for (i = 0; i != ages_count; i++)
+ fprintf(file, "\t%#x%s\n", ages[i],
+ ages[i] == unicode_maxage ? "" : ",");
+ fprintf(file, "};\n");
+ fprintf(file, "\n");
+ fprintf(file, "static const struct utf8data utf8nfkdicfdata[] = {\n");
+ t = 0;
+ for (gen = 0; gen < ages_count; gen++) {
+ fprintf(file, "\t{ %#x, %d }%s\n",
+ ages[gen], trees[t].index,
+ ages[gen] == unicode_maxage ? "" : ",");
+ if (trees[t].maxage == ages[gen])
+ t += 2;
+ }
+ fprintf(file, "};\n");
+ fprintf(file, "\n");
+ fprintf(file, "static const struct utf8data utf8nfkdidata[] = {\n");
+ t = 1;
+ for (gen = 0; gen < ages_count; gen++) {
+ fprintf(file, "\t{ %#x, %d }%s\n",
+ ages[gen], trees[t].index,
+ ages[gen] == unicode_maxage ? "" : ",");
+ if (trees[t].maxage == ages[gen])
+ t += 2;
+ }
+ fprintf(file, "};\n");
+ fprintf(file, "\n");
+ fprintf(file, "static const unsigned char utf8data[%zd] = {\n",
+ utf8data_size);
+ t = 0;
+ for (i = 0; i != utf8data_size; i += 16) {
+ if (i == trees[t].index) {
+ fprintf(file, "\t/* %s_%x */\n",
+ trees[t].type, trees[t].maxage);
+ if (t < trees_count-1)
+ t++;
+ }
+ fprintf(file, "\t");
+ for (j = i; j != i + 16; j++)
+ fprintf(file, "0x%.2x%s", utf8data[j],
+ (j < utf8data_size -1 ? "," : ""));
+ fprintf(file, "\n");
+ }
+ fprintf(file, "};\n");
+ fclose(file);
+}
+
+/* ------------------------------------------------------------------ */
+
+int
+main(int argc, char *argv[])
+{
+ unsigned int unichar;
+ int opt;
+
+ argv0 = argv[0];
+
+ while ((opt = getopt(argc, argv, "a:c:d:f:hn:o:p:t:v")) != -1) {
+ switch (opt) {
+ case 'a':
+ age_name = optarg;
+ break;
+ case 'c':
+ ccc_name = optarg;
+ break;
+ case 'd':
+ data_name = optarg;
+ break;
+ case 'f':
+ fold_name = optarg;
+ break;
+ case 'n':
+ norm_name = optarg;
+ break;
+ case 'o':
+ utf8_name = optarg;
+ break;
+ case 'p':
+ prop_name = optarg;
+ break;
+ case 't':
+ test_name = optarg;
+ break;
+ case 'v':
+ verbose++;
+ break;
+ case 'h':
+ help();
+ exit(0);
+ default:
+ usage();
+ }
+ }
+
+ if (verbose > 1)
+ help();
+ for (unichar = 0; unichar != 0x110000; unichar++)
+ unicode_data[unichar].code = unichar;
+ age_init();
+ ccc_init();
+ nfkdi_init();
+ nfkdicf_init();
+ ignore_init();
+ corrections_init();
+ hangul_decompose();
+ nfkdi_decompose();
+ nfkdicf_decompose();
+ utf8_init();
+ trees_init();
+ trees_populate();
+ trees_reduce();
+ trees_verify();
+ /* Prevent "unused function" warning. */
+ (void)lookup(nfkdi_tree, " ");
+ if (verbose > 2)
+ tree_walk(nfkdi_tree);
+ if (verbose > 2)
+ tree_walk(nfkdicf_tree);
+ normalization_test();
+ write_file();
+
+ return 0;
+}
diff --git a/fs/xfs/support/utf8norm.c b/fs/xfs/support/utf8norm.c
new file mode 100644
index 0000000..3a8b3ab
--- /dev/null
+++ b/fs/xfs/support/utf8norm.c
@@ -0,0 +1,641 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "xfs.h"
+#include "xfs_types.h"
+#include "utf8norm.h"
+
+struct utf8data {
+ unsigned int maxage;
+ unsigned int offset;
+};
+
+#define __INCLUDED_FROM_UTF8NORM_C__
+#include "utf8data.h"
+#undef __INCLUDED_FROM_UTF8NORM_C__
+
+/*
+ * UTF-8 valid ranges.
+ *
+ * The UTF-8 encoding spreads the bits of a 32bit word over several
+ * bytes. This table gives the ranges that can be held and how they'd
+ * be represented.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * There is an additional requirement on UTF-8, in that only the
+ * shortest representation of a 32bit value is to be used. A decoder
+ * must not decode sequences that do not satisfy this requirement.
+ * Thus the allowed ranges have a lower bound.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * Actual unicode characters are limited to the range 0x0 - 0x10FFFF,
+ * 17 planes of 65536 values. This limits the sequences actually seen
+ * even more, to just the following.
+ *
+ * 0 - 0x7F: 0 - 0x7F
+ * 0x80 - 0x7FF: 0xC2 0x80 - 0xDF 0xBF
+ * 0x800 - 0xFFFF: 0xE0 0xA0 0x80 - 0xEF 0xBF 0xBF
+ * 0x10000 - 0x10FFFF: 0xF0 0x90 0x80 0x80 - 0xF4 0x8F 0xBF 0xBF
+ *
+ * Within those ranges the surrogates 0xD800 - 0xDFFF are not allowed.
+ *
+ * Note that the longest sequence seen with valid usage is 4 bytes,
+ * the same a single UTF-32 character. This makes the UTF-8
+ * representation of Unicode strictly smaller than UTF-32.
+ *
+ * The shortest sequence requirement was introduced by:
+ * Corrigendum #1: UTF-8 Shortest Form
+ * It can be found here:
+ * http://www.unicode.org/versions/corrigendum1.html
+ *
+ */
+
+/*
+ * Return the number of bytes used by the current UTF-8 sequence.
+ * Assumes the input points to the first byte of a valid UTF-8
+ * sequence.
+ */
+static inline int
+utf8clen(const char *s)
+{
+ unsigned char c = *s;
+ return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0);
+}
+
+/*
+ * utf8trie_t
+ *
+ * A compact binary tree, used to decode UTF-8 characters.
+ *
+ * Internal nodes are one byte for the node itself, and up to three
+ * bytes for an offset into the tree. The first byte contains the
+ * following information:
+ * NEXTBYTE - flag - advance to next byte if set
+ * BITNUM - 3 bit field - the bit number to tested
+ * OFFLEN - 2 bit field - number of bytes in the offset
+ * if offlen == 0 (non-branching node)
+ * RIGHTPATH - 1 bit field - set if the following node is for the
+ * right-hand path (tested bit is set)
+ * TRIENODE - 1 bit field - set if the following node is an internal
+ * node, otherwise it is a leaf node
+ * if offlen != 0 (branching node)
+ * LEFTNODE - 1 bit field - set if the left-hand node is internal
+ * RIGHTNODE - 1 bit field - set if the right-hand node is internal
+ *
+ * Due to the way utf8 works, there cannot be branching nodes with
+ * NEXTBYTE set, and moreover those nodes always have a righthand
+ * descendant.
+ */
+typedef const unsigned char utf8trie_t;
+#define BITNUM 0x07
+#define NEXTBYTE 0x08
+#define OFFLEN 0x30
+#define OFFLEN_SHIFT 4
+#define RIGHTPATH 0x40
+#define TRIENODE 0x80
+#define RIGHTNODE 0x40
+#define LEFTNODE 0x80
+
+/*
+ * utf8leaf_t
+ *
+ * The leaves of the trie are embedded in the trie, and so the same
+ * underlying datatype: unsigned char.
+ *
+ * leaf[0]: The unicode version, stored as a generation number that is
+ * an index into utf8agetab[]. With this we can filter code
+ * points based on the unicode version in which they were
+ * defined. The CCC of a non-defined code point is 0.
+ * leaf[1]: Canonical Combining Class. During normalization, we need
+ * to do a stable sort into ascending order of all characters
+ * with a non-zero CCC that occur between two characters with
+ * a CCC of 0, or at the begin or end of a string.
+ * The unicode standard guarantees that all CCC values are
+ * between 0 and 254 inclusive, which leaves 255 available as
+ * a special value.
+ * Code points with CCC 0 are known as stoppers.
+ * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the
+ * start of a NUL-terminated string that is the decomposition
+ * of the character.
+ * The CCC of a decomposable character is the same as the CCC
+ * of the first character of its decomposition.
+ * Some characters decompose as the empty string: these are
+ * characters with the Default_Ignorable_Code_Point property.
+ * These do affect normalization, as they all have CCC 0.
+ *
+ * The decompositions in the trie have been fully expanded.
+ *
+ * Casefolding, if applicable, is also done using decompositions.
+ *
+ * The trie is constructed in such a way that leaves exist for all
+ * UTF-8 sequences that match the criteria from the "UTF-8 valid
+ * ranges" comment above, and only for those sequences. Therefore a
+ * lookup in the trie can be used to validate the UTF-8 input.
+ */
+typedef const unsigned char utf8leaf_t;
+
+#define LEAF_GEN(LEAF) ((LEAF)[0])
+#define LEAF_CCC(LEAF) ((LEAF)[1])
+#define LEAF_STR(LEAF) ((const char*)((LEAF) + 2))
+
+#define MINCCC (0)
+#define MAXCCC (254)
+#define STOPPER (0)
+#define DECOMPOSE (255)
+
+/*
+ * Use trie to scan s, touching at most len bytes.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * A non-NULL return guarantees that the UTF-8 sequence starting at s
+ * is well-formed and corresponds to a known unicode code point. The
+ * shorthand for this will be "is valid UTF-8 unicode".
+ */
+static utf8leaf_t *
+utf8nlookup(utf8data_t data, const char *s, size_t len)
+{
+ utf8trie_t *trie = utf8data + data->offset;
+ int offlen;
+ int offset;
+ int mask;
+ int node;
+
+ if (!data)
+ return NULL;
+ if (len == 0)
+ return NULL;
+ node = 1;
+ while (node) {
+ offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT;
+ if (*trie & NEXTBYTE) {
+ if (--len == 0)
+ return NULL;
+ s++;
+ }
+ mask = 1 << (*trie & BITNUM);
+ if (*s & mask) {
+ /* Right leg */
+ if (offlen) {
+ /* Right node at offset of trie */
+ node = (*trie & RIGHTNODE);
+ offset = trie[offlen];
+ while (--offlen) {
+ offset <<= 8;
+ offset |= trie[offlen];
+ }
+ trie += offset;
+ } else if (*trie & RIGHTPATH) {
+ /* Right node after this node */
+ node = (*trie & TRIENODE);
+ trie++;
+ } else {
+ /* No right node. */
+ node = 0;
+ trie = NULL;
+ }
+ } else {
+ /* Left leg */
+ if (offlen) {
+ /* Left node after this node. */
+ node = (*trie & LEFTNODE);
+ trie += offlen + 1;
+ } else if (*trie & RIGHTPATH) {
+ /* No left node. */
+ node = 0;
+ trie = NULL;
+ } else {
+ /* Left node after this node */
+ node = (*trie & TRIENODE);
+ trie++;
+ }
+ }
+ }
+ return trie;
+}
+
+/*
+ * Use trie to scan s.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * Forwards to utf8nlookup().
+ */
+static utf8leaf_t *
+utf8lookup(utf8data_t data, const char *s)
+{
+ return utf8nlookup(data, s, (size_t)-1);
+}
+
+/*
+ * Maximum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if only non-assigned code points are used.
+ */
+int
+utf8agemax(utf8data_t data, const char *s)
+{
+ utf8leaf_t *leaf;
+ int age = 0;
+ int leaf_age;
+
+ if (!data)
+ return -1;
+ while (*s) {
+ if (!(leaf = utf8lookup(data, s)))
+ return -1;
+ leaf_age = utf8agetab[LEAF_GEN(leaf)];
+ if (leaf_age <= data->maxage && leaf_age > age)
+ age = leaf_age;
+ s += utf8clen(s);
+ }
+ return age;
+}
+EXPORT_SYMBOL(utf8agemax);
+
+/*
+ * Minimum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if non-assigned code points are used.
+ */
+int
+utf8agemin(utf8data_t data, const char *s)
+{
+ utf8leaf_t *leaf;
+ int age;
+ int leaf_age;
+
+ if (!data)
+ return -1;
+ age = data->maxage;
+ while (*s) {
+ if (!(leaf = utf8lookup(data, s)))
+ return -1;
+ leaf_age = utf8agetab[LEAF_GEN(leaf)];
+ if (leaf_age <= data->maxage && leaf_age < age)
+ age = leaf_age;
+ s += utf8clen(s);
+ }
+ return age;
+}
+EXPORT_SYMBOL(utf8agemin);
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemax(utf8data_t data, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ int age = 0;
+ int leaf_age;
+
+ if (!data)
+ return -1;
+ while (len && *s) {
+ if (!(leaf = utf8nlookup(data, s, len)))
+ return -1;
+ leaf_age = utf8agetab[LEAF_GEN(leaf)];
+ if (leaf_age <= data->maxage && leaf_age > age)
+ age = leaf_age;
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return age;
+}
+EXPORT_SYMBOL(utf8nagemax);
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemin(utf8data_t data, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ int leaf_age;
+ int age;
+
+ if (!data)
+ return -1;
+ age = data->maxage;
+ while (len && *s) {
+ if (!(leaf = utf8nlookup(data, s, len)))
+ return -1;
+ leaf_age = utf8agetab[LEAF_GEN(leaf)];
+ if (leaf_age <= data->maxage && leaf_age < age)
+ age = leaf_age;
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return age;
+}
+EXPORT_SYMBOL(utf8nagemin);
+
+/*
+ * Length of the normalization of s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ *
+ * A string of Default_Ignorable_Code_Point has length 0.
+ */
+ssize_t
+utf8len(utf8data_t data, const char *s)
+{
+ utf8leaf_t *leaf;
+ size_t ret = 0;
+
+ if (!data)
+ return -1;
+ while (*s) {
+ if (!(leaf = utf8lookup(data, s)))
+ return -1;
+ if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
+ ret += utf8clen(s);
+ else if (LEAF_CCC(leaf) == DECOMPOSE)
+ ret += strlen(LEAF_STR(leaf));
+ else
+ ret += utf8clen(s);
+ s += utf8clen(s);
+ }
+ return ret;
+}
+EXPORT_SYMBOL(utf8len);
+
+/*
+ * Length of the normalization of s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+ssize_t
+utf8nlen(utf8data_t data, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ size_t ret = 0;
+
+ if (!data)
+ return -1;
+ while (len && *s) {
+ if (!(leaf = utf8nlookup(data, s, len)))
+ return -1;
+ if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
+ ret += utf8clen(s);
+ else if (LEAF_CCC(leaf) == DECOMPOSE)
+ ret += strlen(LEAF_STR(leaf));
+ else
+ ret += utf8clen(s);
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return ret;
+}
+EXPORT_SYMBOL(utf8nlen);
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ * u8c : pointer to cursor.
+ * data : utf8data_t to use for normalization.
+ * s : string.
+ * len : length of s.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8ncursor(
+ struct utf8cursor *u8c,
+ utf8data_t data,
+ const char *s,
+ size_t len)
+{
+ if (!data)
+ return -1;
+ if (!s)
+ return -1;
+ u8c->data = data;
+ u8c->s = s;
+ u8c->p = NULL;
+ u8c->ss = NULL;
+ u8c->sp = NULL;
+ u8c->len = len;
+ u8c->slen = 0;
+ u8c->ccc = STOPPER;
+ u8c->nccc = STOPPER;
+ /* Check we didn't clobber the maximum length. */
+ if (u8c->len != len)
+ return -1;
+ /* The first byte of s may not be an utf8 continuation. */
+ if (len > 0 && (*s & 0xC0) == 0x80)
+ return -1;
+ return 0;
+}
+EXPORT_SYMBOL(utf8ncursor);
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ * u8c : pointer to cursor.
+ * data : utf8data_t to use for normalization.
+ * s : NUL-terminated string.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8cursor(
+ struct utf8cursor *u8c,
+ utf8data_t data,
+ const char *s)
+{
+ return utf8ncursor(u8c, data, s, (unsigned int)-1);
+}
+EXPORT_SYMBOL(utf8cursor);
+
+/*
+ * Get one byte from the normalized form of the string described by u8c.
+ *
+ * Returns the byte cast to an unsigned char on succes, and -1 on failure.
+ *
+ * The cursor keeps track of the location in the string in u8c->s.
+ * When a character is decomposed, the current location is stored in
+ * u8c->p, and u8c->s is set to the start of the decomposition. Note
+ * that bytes from a decomposition do not count against u8c->len.
+ *
+ * Characters are emitted if they match the current CCC in u8c->ccc.
+ * Hitting end-of-string while u8c->ccc == STOPPER means we're done,
+ * and the function returns 0 in that case.
+ *
+ * Sorting by CCC is done by repeatedly scanning the string. The
+ * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at
+ * the start of the scan. The first pass finds the lowest CCC to be
+ * emitted and stores it in u8c->nccc, the second pass emits the
+ * characters with this CCC and finds the next lowest CCC. This limits
+ * the number of passes to 1 + the number of different CCCs in the
+ * sequence being scanned.
+ *
+ * Therefore:
+ * u8c->p != NULL -> a decomposition is being scanned.
+ * u8c->ss != NULL -> this is a repeating scan.
+ * u8c->ccc == -1 -> this is the first scan of a repeating scan.
+ */
+int
+utf8byte(struct utf8cursor *u8c)
+{
+ utf8leaf_t *leaf;
+ int ccc;
+
+ for (;;) {
+ /* Check for the end of a decomposed character. */
+ if (u8c->p && *u8c->s == '\0') {
+ u8c->s = u8c->p;
+ u8c->p = NULL;
+ }
+
+ /* Check for end-of-string. */
+ if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) {
+ /* There is no next byte. */
+ if (u8c->ccc == STOPPER)
+ return 0;
+ /* End-of-string during a scan counts as a stopper. */
+ ccc = STOPPER;
+ goto ccc_mismatch;
+ } else if ((*u8c->s & 0xC0) == 0x80) {
+ /* This is a continuation of the current character. */
+ if (!u8c->p)
+ u8c->len--;
+ return (unsigned char)*u8c->s++;
+ }
+
+ /* Look up the data for the current character. */
+ if (u8c->p)
+ leaf = utf8lookup(u8c->data, u8c->s);
+ else
+ leaf = utf8nlookup(u8c->data, u8c->s, u8c->len);
+
+ /* No leaf found implies that the input is a binary blob. */
+ if (!leaf)
+ return -1;
+
+ /* Characters that are too new have CCC 0. */
+ if (utf8agetab[LEAF_GEN(leaf)] > u8c->data->maxage) {
+ ccc = STOPPER;
+ } else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) {
+ u8c->len -= utf8clen(u8c->s);
+ u8c->p = u8c->s + utf8clen(u8c->s);
+ u8c->s = LEAF_STR(leaf);
+ /* Empty decomposition implies CCC 0. */
+ if (*u8c->s == '\0') {
+ if (u8c->ccc == STOPPER)
+ continue;
+ ccc = STOPPER;
+ goto ccc_mismatch;
+ }
+ leaf = utf8lookup(u8c->data, u8c->s);
+ ccc = LEAF_CCC(leaf);
+ }
+
+ /*
+ * If this is not a stopper, then see if it updates
+ * the next canonical class to be emitted.
+ */
+ if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc)
+ u8c->nccc = ccc;
+
+ /*
+ * Return the current byte if this is the current
+ * combining class.
+ */
+ if (ccc == u8c->ccc) {
+ if (!u8c->p)
+ u8c->len--;
+ return (unsigned char)*u8c->s++;
+ }
+
+ /* Current combining class mismatch. */
+ ccc_mismatch:
+ if (u8c->nccc == STOPPER) {
+ /*
+ * Scan forward for the first canonical class
+ * to be emitted. Save the position from
+ * which to restart.
+ */
+ u8c->ccc = MINCCC - 1;
+ u8c->nccc = ccc;
+ u8c->sp = u8c->p;
+ u8c->ss = u8c->s;
+ u8c->slen = u8c->len;
+ if (!u8c->p)
+ u8c->len -= utf8clen(u8c->s);
+ u8c->s += utf8clen(u8c->s);
+ } else if (ccc != STOPPER) {
+ /* Not a stopper, and not the ccc we're emitting. */
+ if (!u8c->p)
+ u8c->len -= utf8clen(u8c->s);
+ u8c->s += utf8clen(u8c->s);
+ } else if (u8c->nccc != MAXCCC + 1) {
+ /* At a stopper, restart for next ccc. */
+ u8c->ccc = u8c->nccc;
+ u8c->nccc = MAXCCC + 1;
+ u8c->s = u8c->ss;
+ u8c->p = u8c->sp;
+ u8c->len = u8c->slen;
+ } else {
+ /* All done, proceed from here. */
+ u8c->ccc = STOPPER;
+ u8c->nccc = STOPPER;
+ u8c->sp = NULL;
+ u8c->ss = NULL;
+ u8c->slen = 0;
+ }
+ }
+}
+EXPORT_SYMBOL(utf8byte);
+
+const struct utf8data *
+utf8nfkdi(unsigned int maxage)
+{
+ int i = sizeof(utf8nfkdidata)/sizeof(utf8nfkdidata[0]) - 1;
+
+ while (maxage < utf8nfkdidata[i].maxage)
+ i--;
+ if (maxage > utf8nfkdidata[i].maxage)
+ return NULL;
+ return &utf8nfkdidata[i];
+}
+EXPORT_SYMBOL(utf8nfkdi);
+
+const struct utf8data *
+utf8nfkdicf(unsigned int maxage)
+{
+ int i = sizeof(utf8nfkdicfdata)/sizeof(utf8nfkdicfdata[0]) - 1;
+
+ while (maxage < utf8nfkdicfdata[i].maxage)
+ i--;
+ if (maxage > utf8nfkdicfdata[i].maxage)
+ return NULL;
+ return &utf8nfkdicfdata[i];
+}
+EXPORT_SYMBOL(utf8nfkdicf);
diff --git a/fs/xfs/support/utf8norm.h b/fs/xfs/support/utf8norm.h
new file mode 100644
index 0000000..6aa3391
--- /dev/null
+++ b/fs/xfs/support/utf8norm.h
@@ -0,0 +1,111 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef UTF8NORM_H
+#define UTF8NORM_H
+
+/* An opaque type used to determine the normalization in use. */
+typedef const struct utf8data *utf8data_t;
+
+/* Encoding a unicode version number as a single unsigned int. */
+#define UNICODE_MAJ_SHIFT (16)
+#define UNICODE_MIN_SHIFT (8)
+
+#define UNICODE_AGE(MAJ,MIN,REV) \
+ (((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) | \
+ ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) | \
+ ((unsigned int)(REV)))
+
+/* Highest unicode version supported by the data tables. */
+extern const unsigned int utf8version;
+
+/*
+ * Look for the correct utf8data_t for a unicode version.
+ * Returns NULL if the version requested is too new.
+ *
+ * Two normalization forms are supported: nfkdi and nfkdicf.
+ *
+ * nfkdi:
+ * - Apply unicode normalization form NFKD.
+ * - Remove any Default_Ignorable_Code_Point.
+ *
+ * nfkdicf:
+ * - Apply unicode normalization form NFKD.
+ * - Remove any Default_Ignorable_Code_Point.
+ * - Apply a full casefold (C + F).
+ */
+extern utf8data_t utf8nfkdi(unsigned int);
+extern utf8data_t utf8nfkdicf(unsigned int);
+
+/*
+ * Determine the maximum age of any unicode character in the string.
+ * Returns 0 if only unassigned code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern int utf8agemax(utf8data_t, const char *);
+extern int utf8nagemax(utf8data_t, const char *, size_t);
+
+/*
+ * Determine the minimum age of any unicode character in the string.
+ * Returns 0 if any unassigned code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern int utf8agemin(utf8data_t, const char *);
+extern int utf8nagemin(utf8data_t, const char *, size_t);
+
+/*
+ * Determine the length of the normalized from of the string,
+ * excluding any terminating NULL byte.
+ * Returns 0 if only ignorable code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern ssize_t utf8len(utf8data_t, const char *);
+extern ssize_t utf8nlen(utf8data_t, const char *, size_t);
+
+/*
+ * Cursor structure used by the normalizer.
+ */
+struct utf8cursor {
+ utf8data_t data;
+ const char *s;
+ const char *p;
+ const char *ss;
+ const char *sp;
+ unsigned int len;
+ unsigned int slen;
+ short int ccc;
+ short int nccc;
+};
+
+/*
+ * Initialize a utf8cursor to normalize a string.
+ * Returns 0 on success.
+ * Returns -1 on failure.
+ */
+extern int utf8cursor(struct utf8cursor *, utf8data_t, const char *);
+extern int utf8ncursor(struct utf8cursor *, utf8data_t, const char *, size_t);
+
+/*
+ * Get the next byte in the normalization.
+ * Returns a value > 0 && < 256 on success.
+ * Returns 0 when the end of the normalization is reached.
+ * Returns -1 if the string being normalized is not valid UTF-8.
+ */
+extern int utf8byte(struct utf8cursor *);
+
+#endif /* UTF8NORM_H */
--
1.7.12.4
Ben Myers
2014-09-11 20:49:26 UTC
Permalink
From: Olaf Weber <***@sgi.com>

The xfs_utf8_nameops use the nfkdi normalization when comparing filenames,
and are installed if the utf8bit is set in the super block.

The xfs_utf8_ci_nameops use the nfkdicf normalization when comparing
filenames, and are installed if both the utf8bit and the borgbit are set
in the superblock.

Normalized filenames are not stored on disk. Normalization will fail if a
filename is not valid UTF-8, in which case the filename is treated as an
opaque blob.

Signed-off-by: Olaf Weber <***@sgi.com>
---
fs/xfs/Makefile | 1 +
fs/xfs/libxfs/xfs_dir2.c | 16 +++-
fs/xfs/xfs_iops.c | 2 +-
fs/xfs/xfs_utf8.c | 242 +++++++++++++++++++++++++++++++++++++++++++++++
fs/xfs/xfs_utf8.h | 25 +++++
5 files changed, 281 insertions(+), 5 deletions(-)
create mode 100644 fs/xfs/xfs_utf8.c
create mode 100644 fs/xfs/xfs_utf8.h

diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 0f7b300..5cc10f5 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -88,6 +88,7 @@ xfs-y += xfs_aops.o \
xfs_symlink.o \
xfs_sysfs.o \
xfs_trans.o \
+ xfs_utf8.o \
xfs_xattr.o \
kmem.o \
uuid.o
diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c
index 84e5ca9..651ff94 100644
--- a/fs/xfs/libxfs/xfs_dir2.c
+++ b/fs/xfs/libxfs/xfs_dir2.c
@@ -35,6 +35,7 @@
#include "xfs_error.h"
#include "xfs_trace.h"
#include "xfs_dinode.h"
+#include "xfs_utf8.h"

struct xfs_name xfs_name_dotdot = { (unsigned char *)"..", 2, XFS_DIR3_FT_DIR };

@@ -156,10 +157,17 @@ xfs_da_mount(
(uint)sizeof(xfs_da_node_entry_t);
dageo->magicpct = (dageo->blksize * 37) / 100;

- if (xfs_sb_version_hasasciici(&mp->m_sb))
- mp->m_dirnameops = &xfs_ascii_ci_nameops;
- else
- mp->m_dirnameops = &xfs_default_nameops;
+ if (xfs_sb_version_hasutf8(&mp->m_sb)) {
+ if (xfs_sb_version_hasasciici(&mp->m_sb))
+ mp->m_dirnameops = &xfs_utf8_ci_nameops;
+ else
+ mp->m_dirnameops = &xfs_utf8_nameops;
+ } else {
+ if (xfs_sb_version_hasasciici(&mp->m_sb))
+ mp->m_dirnameops = &xfs_ascii_ci_nameops;
+ else
+ mp->m_dirnameops = &xfs_default_nameops;
+ }

return 0;
}
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index cea3d64..fbfb1bb 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1257,7 +1257,7 @@ xfs_setup_inode(
break;
case S_IFDIR:
lockdep_set_class(&ip->i_lock.mr_lock, &xfs_dir_ilock_class);
- if (xfs_sb_version_hasasciici(&XFS_M(inode->i_sb)->m_sb))
+ if (xfs_sb_version_hasci(&XFS_M(inode->i_sb)->m_sb))
inode->i_op = &xfs_dir_ci_inode_operations;
else
inode->i_op = &xfs_dir_inode_operations;
diff --git a/fs/xfs/xfs_utf8.c b/fs/xfs/xfs_utf8.c
new file mode 100644
index 0000000..7c18e43
--- /dev/null
+++ b/fs/xfs/xfs_utf8.c
@@ -0,0 +1,242 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_types.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_inum.h"
+#include "xfs_trans.h"
+#include "xfs_trans_resv.h"
+#include "xfs_sb.h"
+#include "xfs_ag.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_dir2.h"
+#include "xfs_mount.h"
+#include "xfs_da_btree.h"
+#include "xfs_format.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_alloc_btree.h"
+#include "xfs_dinode.h"
+#include "xfs_inode.h"
+#include "xfs_inode_item.h"
+#include "xfs_bmap.h"
+#include "xfs_error.h"
+#include "xfs_trace.h"
+#include "xfs_utf8.h"
+#include <support/utf8norm.h>
+
+/*
+ * xfs nameops using nfkdi
+ */
+
+static xfs_dahash_t
+xfs_utf8_hashname(
+ const unsigned char *name,
+ int len)
+{
+ utf8data_t nfkdi;
+ struct utf8cursor u8c;
+ xfs_dahash_t hash;
+ int val;
+
+ nfkdi = utf8nfkdi(utf8version);
+ hash = 0;
+ if (utf8ncursor(&u8c, nfkdi, name, len) < 0)
+ goto blob;
+ while ((val = utf8byte(&u8c)) > 0)
+ hash = val ^ rol32(hash, 7);
+ /* In case of error treat the name as a binary blob. */
+ if (val == 0)
+ return hash;
+blob:
+ return xfs_da_hashname(name, len);
+}
+
+static int
+xfs_utf8_normhash(
+ struct xfs_da_args *args)
+{
+ utf8data_t nfkdi;
+ struct utf8cursor u8c;
+ unsigned char *norm;
+ ssize_t normlen;
+ int c;
+
+ nfkdi = utf8nfkdi(utf8version);
+ /* Failure to normalize is treated as a blob. */
+ if ((normlen = utf8nlen(nfkdi, args->name, args->namelen)) < 0)
+ goto blob;
+ if (utf8ncursor(&u8c, nfkdi, args->name, args->namelen) < 0)
+ goto blob;
+ if (!(norm = kmem_alloc(normlen + 1, KM_NOFS|KM_MAYFAIL)))
+ return -ENOMEM;
+ args->norm = norm;
+ args->normlen = normlen;
+ while ((c = utf8byte(&u8c)) > 0)
+ *norm++ = c;
+ if (c == 0) {
+ *norm = '\0';
+ args->hashval = xfs_da_hashname(args->norm, args->normlen);
+ return 0;
+ }
+ kmem_free(args->norm);
+blob:
+ args->norm = NULL;
+ args->normlen = -1;
+ args->hashval = xfs_da_hashname(args->name, args->namelen);
+ return 0;
+}
+
+static enum xfs_dacmp
+xfs_utf8_compname(
+ struct xfs_da_args *args,
+ const unsigned char *name,
+ int len)
+{
+ utf8data_t nfkdi;
+ struct utf8cursor u8c;
+ const unsigned char *norm;
+ int c;
+
+ ASSERT(args->norm || args->normlen == -1);
+
+ /* Check for an exact match first. */
+ if (args->namelen == len && memcmp(args->name, name, len) == 0)
+ return XFS_CMP_EXACT;
+ /* xfs_utf8_normhash() set args->normlen to -1 for a blob */
+ if (args->normlen < 0)
+ return XFS_CMP_DIFFERENT;
+ nfkdi = utf8nfkdi(utf8version);
+ if (utf8ncursor(&u8c, nfkdi, name, len) < 0)
+ return XFS_CMP_DIFFERENT;
+ norm = args->norm;
+ while ((c = utf8byte(&u8c)) > 0)
+ if (c != *norm++)
+ return XFS_CMP_DIFFERENT;
+ if (c < 0 || *norm != '\0')
+ return XFS_CMP_DIFFERENT;
+ return XFS_CMP_MATCH;
+}
+
+struct xfs_nameops xfs_utf8_nameops = {
+ .hashname = xfs_utf8_hashname,
+ .normhash = xfs_utf8_normhash,
+ .compname = xfs_utf8_compname,
+};
+
+/*
+ * xfs nameops using nfkdicf
+ */
+
+static xfs_dahash_t
+xfs_utf8_ci_hashname(
+ const unsigned char *name,
+ int len)
+{
+ utf8data_t nfkdicf;
+ struct utf8cursor u8c;
+ xfs_dahash_t hash;
+ int val;
+
+ nfkdicf = utf8nfkdicf(utf8version);
+ hash = 0;
+ if (utf8ncursor(&u8c, nfkdicf, name, len) < 0)
+ goto blob;
+ while ((val = utf8byte(&u8c)) > 0)
+ hash = val ^ rol32(hash, 7);
+ /* In case of error treat the name as a binary blob. */
+ if (val == 0)
+ return hash;
+blob:
+ return xfs_da_hashname(name, len);
+}
+
+static int
+xfs_utf8_ci_normhash(
+ struct xfs_da_args *args)
+{
+ utf8data_t nfkdicf;
+ struct utf8cursor u8c;
+ unsigned char *norm;
+ ssize_t normlen;
+ int c;
+
+ nfkdicf = utf8nfkdicf(utf8version);
+ /* Failure to normalize is treated as a blob. */
+ if ((normlen = utf8nlen(nfkdicf, args->name, args->namelen)) < 0)
+ goto blob;
+ if (utf8ncursor(&u8c, nfkdicf, args->name, args->namelen) < 0)
+ goto blob;
+ if (!(norm = kmem_alloc(normlen + 1, KM_NOFS|KM_MAYFAIL)))
+ return -ENOMEM;
+ args->norm = norm;
+ args->normlen = normlen;
+ while ((c = utf8byte(&u8c)) > 0)
+ *norm++ = c;
+ if (c == 0) {
+ *norm = '\0';
+ args->hashval = xfs_da_hashname(args->norm, args->normlen);
+ return 0;
+ }
+ kmem_free(args->norm);
+blob:
+ args->norm = NULL;
+ args->normlen = -1;
+ args->hashval = xfs_da_hashname(args->name, args->namelen);
+ return 0;
+}
+
+static enum xfs_dacmp
+xfs_utf8_ci_compname(
+ struct xfs_da_args *args,
+ const unsigned char *name,
+ int len)
+{
+ utf8data_t nfkdicf;
+ struct utf8cursor u8c;
+ const unsigned char *norm;
+ int c;
+
+ ASSERT(args->norm || args->normlen == -1);
+
+ /* Check for an exact match first. */
+ if (args->namelen == len && memcmp(args->name, name, len) == 0)
+ return XFS_CMP_EXACT;
+ /* xfs_utf8_ci_normhash() set args->normlen to -1 for a blob */
+ if (args->normlen < 0)
+ return XFS_CMP_DIFFERENT;
+ nfkdicf = utf8nfkdicf(utf8version);
+ if (utf8ncursor(&u8c, nfkdicf, name, len) < 0)
+ return XFS_CMP_DIFFERENT;
+ norm = args->norm;
+ while ((c = utf8byte(&u8c)) > 0)
+ if (c != *norm++)
+ return XFS_CMP_DIFFERENT;
+ if (c < 0 || *norm != '\0')
+ return XFS_CMP_DIFFERENT;
+ return XFS_CMP_MATCH;
+}
+
+struct xfs_nameops xfs_utf8_ci_nameops = {
+ .hashname = xfs_utf8_ci_hashname,
+ .normhash = xfs_utf8_ci_normhash,
+ .compname = xfs_utf8_ci_compname,
+};
diff --git a/fs/xfs/xfs_utf8.h b/fs/xfs/xfs_utf8.h
new file mode 100644
index 0000000..97b6a91
--- /dev/null
+++ b/fs/xfs/xfs_utf8.h
@@ -0,0 +1,25 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef XFS_UTF8_H
+#define XFS_UTF8_H
+
+extern struct xfs_nameops xfs_utf8_nameops;
+extern struct xfs_nameops xfs_utf8_ci_nameops;
+
+#endif /* XFS_UTF8_H */
--
1.7.12.4
Ben Myers
2014-09-11 20:50:23 UTC
Permalink
From: Olaf Weber <***@sgi.com>

Apply the same rules for UTF-8 normalization to the names of user-defined
extended attributes. System attributes are excluded because they are not
user-visible in the first place, and the kernel is expected to know what
it is doing when naming them.

Signed-off-by: Olaf Weber <***@sgi.com>
---
fs/xfs/libxfs/xfs_attr.c | 56 ++++++++++++++++++++++++++++++++++++-------
fs/xfs/libxfs/xfs_attr_leaf.c | 11 +++++++--
fs/xfs/xfs_attr_list.c | 11 ++++++++-
fs/xfs/xfs_utf8.c | 7 ++++++
4 files changed, 74 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index 353fb42..68e7ce3 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -83,12 +83,14 @@ xfs_attr_args_init(
const unsigned char *name,
int flags)
{
+ struct xfs_mount *mp = dp->i_mount;
+ int error;

if (!name)
return -EINVAL;

memset(args, 0, sizeof(*args));
- args->geo = dp->i_mount->m_attr_geo;
+ args->geo = mp->m_attr_geo;
args->whichfork = XFS_ATTR_FORK;
args->dp = dp;
args->flags = flags;
@@ -97,7 +99,11 @@ xfs_attr_args_init(
if (args->namelen >= MAXNAMELEN)
return -EFAULT; /* match IRIX behaviour */

- args->hashval = xfs_da_hashname(args->name, args->namelen);
+ if (!xfs_sb_version_hasutf8(&mp->m_sb))
+ args->hashval = xfs_da_hashname(args->name, args->namelen);
+ else if ((error = mp->m_dirnameops->normhash(args)) != 0)
+ return error;
+
return 0;
}

@@ -154,6 +160,9 @@ xfs_attr_get(
error = xfs_attr_node_get(&args);
xfs_iunlock(ip, lock_mode);

+ if (args.norm)
+ kmem_free(args.norm);
+
*valuelenp = args.valuelen;
return error == -EEXIST ? 0 : error;
}
@@ -216,8 +225,11 @@ xfs_attr_set(
return -EIO;

error = xfs_attr_args_init(&args, dp, name, flags);
- if (error)
+ if (error) {
+ if (args.norm)
+ kmem_free(args.norm);
return error;
+ }

args.value = value;
args.valuelen = valuelen;
@@ -227,8 +239,11 @@ xfs_attr_set(
args.total = xfs_attr_calc_size(&args, &local);

error = xfs_qm_dqattach(dp, 0);
- if (error)
+ if (error) {
+ if (args.norm)
+ kmem_free(args.norm);
return error;
+ }

/*
* If the inode doesn't have an attribute fork, add one.
@@ -239,8 +254,11 @@ xfs_attr_set(
XFS_ATTR_SF_ENTSIZE_BYNAME(args.namelen, valuelen);

error = xfs_bmap_add_attrfork(dp, sf_size, rsvd);
- if (error)
+ if (error) {
+ if (args.norm)
+ kmem_free(args.norm);
return error;
+ }
}

/*
@@ -270,6 +288,8 @@ xfs_attr_set(
error = xfs_trans_reserve(args.trans, &tres, args.total, 0);
if (error) {
xfs_trans_cancel(args.trans, 0);
+ if (args.norm)
+ kmem_free(args.norm);
return error;
}
xfs_ilock(dp, XFS_ILOCK_EXCL);
@@ -280,6 +300,8 @@ xfs_attr_set(
if (error) {
xfs_iunlock(dp, XFS_ILOCK_EXCL);
xfs_trans_cancel(args.trans, XFS_TRANS_RELEASE_LOG_RES);
+ if (args.norm)
+ kmem_free(args.norm);
return error;
}

@@ -327,6 +349,8 @@ xfs_attr_set(
XFS_TRANS_RELEASE_LOG_RES);
xfs_iunlock(dp, XFS_ILOCK_EXCL);

+ if (args.norm)
+ kmem_free(args.norm);
return error ? error : err2;
}

@@ -388,7 +412,8 @@ xfs_attr_set(
xfs_trans_log_inode(args.trans, dp, XFS_ILOG_CORE);
error = xfs_trans_commit(args.trans, XFS_TRANS_RELEASE_LOG_RES);
xfs_iunlock(dp, XFS_ILOCK_EXCL);
-
+ if (args.norm)
+ kmem_free(args.norm);
return error;

out:
@@ -397,6 +422,8 @@ out:
XFS_TRANS_RELEASE_LOG_RES|XFS_TRANS_ABORT);
}
xfs_iunlock(dp, XFS_ILOCK_EXCL);
+ if (args.norm)
+ kmem_free(args.norm);
return error;
}

@@ -425,8 +452,11 @@ xfs_attr_remove(
return -ENOATTR;

error = xfs_attr_args_init(&args, dp, name, flags);
- if (error)
+ if (error) {
+ if (args.norm)
+ kmem_free(args.norm);
return error;
+ }

args.firstblock = &firstblock;
args.flist = &flist;
@@ -439,8 +469,11 @@ xfs_attr_remove(
args.op_flags = XFS_DA_OP_OKNOENT;

error = xfs_qm_dqattach(dp, 0);
- if (error)
+ if (error) {
+ if (args.norm)
+ kmem_free(args.norm);
return error;
+ }

/*
* Start our first transaction of the day.
@@ -466,6 +499,8 @@ xfs_attr_remove(
XFS_ATTRRM_SPACE_RES(mp), 0);
if (error) {
xfs_trans_cancel(args.trans, 0);
+ if (args.norm)
+ kmem_free(args.norm);
return error;
}

@@ -506,6 +541,8 @@ xfs_attr_remove(
xfs_trans_log_inode(args.trans, dp, XFS_ILOG_CORE);
error = xfs_trans_commit(args.trans, XFS_TRANS_RELEASE_LOG_RES);
xfs_iunlock(dp, XFS_ILOCK_EXCL);
+ if (args.norm)
+ kmem_free(args.norm);

return error;

@@ -515,6 +552,9 @@ out:
XFS_TRANS_RELEASE_LOG_RES|XFS_TRANS_ABORT);
}
xfs_iunlock(dp, XFS_ILOCK_EXCL);
+ if (args.norm)
+ kmem_free(args.norm);
+
return error;
}

diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index b1f73db..c991a88 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -661,6 +661,7 @@ int
xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
{
xfs_inode_t *dp;
+ struct xfs_mount *mp;
xfs_attr_shortform_t *sf;
xfs_attr_sf_entry_t *sfe;
xfs_da_args_t nargs;
@@ -673,6 +674,7 @@ xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
trace_xfs_attr_sf_to_leaf(args);

dp = args->dp;
+ mp = dp->i_mount;
ifp = dp->i_afp;
sf = (xfs_attr_shortform_t *)ifp->if_u1.if_data;
size = be16_to_cpu(sf->hdr.totsize);
@@ -726,13 +728,18 @@ xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
nargs.namelen = sfe->namelen;
nargs.value = &sfe->nameval[nargs.namelen];
nargs.valuelen = sfe->valuelen;
- nargs.hashval = xfs_da_hashname(sfe->nameval,
- sfe->namelen);
nargs.flags = XFS_ATTR_NSP_ONDISK_TO_ARGS(sfe->flags);
+ if (!xfs_sb_version_hasutf8(&mp->m_sb))
+ nargs.hashval = xfs_da_hashname(sfe->nameval,
+ sfe->namelen);
+ else if ((error = mp->m_dirnameops->normhash(&nargs)) != 0)
+ goto out;
error = xfs_attr3_leaf_lookup_int(bp, &nargs); /* set a->index */
ASSERT(error == -ENOATTR);
error = xfs_attr3_leaf_add(bp, &nargs);
ASSERT(error != -ENOSPC);
+ if (nargs.norm)
+ kmem_free(nargs.norm);
if (error)
goto out;
sfe = XFS_ATTR_SF_NEXTENTRY(sfe);
diff --git a/fs/xfs/xfs_attr_list.c b/fs/xfs/xfs_attr_list.c
index 62db83a..4075d54 100644
--- a/fs/xfs/xfs_attr_list.c
+++ b/fs/xfs/xfs_attr_list.c
@@ -76,12 +76,14 @@ xfs_attr_shortform_list(xfs_attr_list_context_t *context)
xfs_attr_shortform_t *sf;
xfs_attr_sf_entry_t *sfe;
xfs_inode_t *dp;
+ struct xfs_mount *mp;
int sbsize, nsbuf, count, i;
int error;

ASSERT(context != NULL);
dp = context->dp;
ASSERT(dp != NULL);
+ mp = dp->i_mount;
ASSERT(dp->i_afp != NULL);
sf = (xfs_attr_shortform_t *)dp->i_afp->if_u1.if_data;
ASSERT(sf != NULL);
@@ -154,7 +156,14 @@ xfs_attr_shortform_list(xfs_attr_list_context_t *context)
}

sbp->entno = i;
- sbp->hash = xfs_da_hashname(sfe->nameval, sfe->namelen);
+ /* ATTR_ROOT and ATTR_SECURE are never normalized. */
+ if (!xfs_sb_version_hasutf8(&mp->m_sb) ||
+ (sfe->flags & (ATTR_ROOT|ATTR_SECURE))) {
+ sbp->hash = xfs_da_hashname(sfe->nameval, sfe->namelen);
+ } else {
+ sbp->hash = mp->m_dirnameops->hashname(sfe->nameval,
+ sfe->namelen);
+ }
sbp->name = sfe->nameval;
sbp->namelen = sfe->namelen;
/* These are bytes, and both on-disk, don't endian-flip */
diff --git a/fs/xfs/xfs_utf8.c b/fs/xfs/xfs_utf8.c
index 7c18e43..8df05fe 100644
--- a/fs/xfs/xfs_utf8.c
+++ b/fs/xfs/xfs_utf8.c
@@ -38,6 +38,7 @@
#include "xfs_inode.h"
#include "xfs_inode_item.h"
#include "xfs_bmap.h"
+#include "xfs_attr.h"
#include "xfs_error.h"
#include "xfs_trace.h"
#include "xfs_utf8.h"
@@ -80,6 +81,9 @@ xfs_utf8_normhash(
ssize_t normlen;
int c;

+ /* Don't normalize system attribute names. */
+ if (args->flags & (ATTR_ROOT|ATTR_SECURE))
+ goto blob;
nfkdi = utf8nfkdi(utf8version);
/* Failure to normalize is treated as a blob. */
if ((normlen = utf8nlen(nfkdi, args->name, args->namelen)) < 0)
@@ -179,6 +183,9 @@ xfs_utf8_ci_normhash(
ssize_t normlen;
int c;

+ /* Don't normalize system attribute names. */
+ if (args->flags & (ATTR_ROOT|ATTR_SECURE))
+ goto blob;
nfkdicf = utf8nfkdicf(utf8version);
/* Failure to normalize is treated as a blob. */
if ((normlen = utf8nlen(nfkdicf, args->name, args->namelen)) < 0)
--
1.7.12.4
Ben Myers
2014-09-11 20:51:46 UTC
Permalink
From: Olaf Weber <***@sgi.com>

Change the XFS case-insensitive lookup code to return the first match found,
even if it is not an exact match. Whether a filesystem uses case-insensitive
lookups is determined by a superblock bit set during filesystem creation.
This means that normal use cannot create two files that both match the same
filename.

Signed-off-by: Olaf Weber <***@sgi.com>
---
libxfs/xfs_dir2_block.c | 17 ++++-------
libxfs/xfs_dir2_leaf.c | 38 ++++-------------------
libxfs/xfs_dir2_node.c | 80 ++++++++++++++++++-------------------------------
libxfs/xfs_dir2_sf.c | 8 ++---
4 files changed, 44 insertions(+), 99 deletions(-)

diff --git a/libxfs/xfs_dir2_block.c b/libxfs/xfs_dir2_block.c
index cede01f..2880431 100644
--- a/libxfs/xfs_dir2_block.c
+++ b/libxfs/xfs_dir2_block.c
@@ -705,28 +705,21 @@ xfs_dir2_block_lookup_int(
dep = (xfs_dir2_data_entry_t *)
((char *)hdr + xfs_dir2_dataptr_to_off(mp, addr));
/*
- * Compare name and if it's an exact match, return the index
- * and buffer. If it's the first case-insensitive match, store
- * the index and buffer and continue looking for an exact match.
+ * Compare name and if it's a match, return the
+ * index and buffer.
*/
cmp = mp->m_dirnameops->compname(args, dep->name, dep->namelen);
- if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) {
+ if (cmp != XFS_CMP_DIFFERENT) {
args->cmpresult = cmp;
*bpp = bp;
*entno = mid;
- if (cmp == XFS_CMP_EXACT)
- return 0;
+ return 0;
}
} while (++mid < be32_to_cpu(btp->count) &&
be32_to_cpu(blp[mid].hashval) == hash);

ASSERT(args->op_flags & XFS_DA_OP_OKNOENT);
- /*
- * Here, we can only be doing a lookup (not a rename or replace).
- * If a case-insensitive match was found earlier, return success.
- */
- if (args->cmpresult == XFS_CMP_CASE)
- return 0;
+ ASSERT(args->cmpresult == XFS_CMP_DIFFERENT);
/*
* No match, release the buffer and return ENOENT.
*/
diff --git a/libxfs/xfs_dir2_leaf.c b/libxfs/xfs_dir2_leaf.c
index 8e0cbc9..b1901d3 100644
--- a/libxfs/xfs_dir2_leaf.c
+++ b/libxfs/xfs_dir2_leaf.c
@@ -1246,7 +1246,6 @@ xfs_dir2_leaf_lookup_int(
xfs_mount_t *mp; /* filesystem mount point */
xfs_dir2_db_t newdb; /* new data block number */
xfs_trans_t *tp; /* transaction pointer */
- xfs_dir2_db_t cidb = -1; /* case match data block no. */
enum xfs_dacmp cmp; /* name compare result */
struct xfs_dir2_leaf_entry *ents;
struct xfs_dir3_icleaf_hdr leafhdr;
@@ -1307,47 +1306,22 @@ xfs_dir2_leaf_lookup_int(
dep = (xfs_dir2_data_entry_t *)((char *)dbp->b_addr +
xfs_dir2_dataptr_to_off(mp, be32_to_cpu(lep->address)));
/*
- * Compare name and if it's an exact match, return the index
- * and buffer. If it's the first case-insensitive match, store
- * the index and buffer and continue looking for an exact match.
+ * Compare name and if it's a match, return the index
+ * and buffer.
*/
cmp = mp->m_dirnameops->compname(args, dep->name, dep->namelen);
- if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) {
+ if (cmp != XFS_CMP_DIFFERENT) {
args->cmpresult = cmp;
*indexp = index;
- /* case exact match: return the current buffer. */
- if (cmp == XFS_CMP_EXACT) {
- *dbpp = dbp;
- return 0;
- }
- cidb = curdb;
+ *dbpp = dbp;
+ return 0;
}
}
ASSERT(args->op_flags & XFS_DA_OP_OKNOENT);
- /*
- * Here, we can only be doing a lookup (not a rename or remove).
- * If a case-insensitive match was found earlier, re-read the
- * appropriate data block if required and return it.
- */
- if (args->cmpresult == XFS_CMP_CASE) {
- ASSERT(cidb != -1);
- if (cidb != curdb) {
- xfs_trans_brelse(tp, dbp);
- error = xfs_dir3_data_read(tp, dp,
- xfs_dir2_db_to_da(mp, cidb),
- -1, &dbp);
- if (error) {
- xfs_trans_brelse(tp, lbp);
- return error;
- }
- }
- *dbpp = dbp;
- return 0;
- }
+ ASSERT(args->cmpresult == XFS_CMP_DIFFERENT);
/*
* No match found, return ENOENT.
*/
- ASSERT(cidb == -1);
if (dbp)
xfs_trans_brelse(tp, dbp);
xfs_trans_brelse(tp, lbp);
diff --git a/libxfs/xfs_dir2_node.c b/libxfs/xfs_dir2_node.c
index 3737e4e..fb27506 100644
--- a/libxfs/xfs_dir2_node.c
+++ b/libxfs/xfs_dir2_node.c
@@ -702,6 +702,7 @@ xfs_dir2_leafn_lookup_for_entry(
xfs_dir2_db_t curdb = -1; /* current data block number */
xfs_dir2_data_entry_t *dep; /* data block entry */
xfs_inode_t *dp; /* incore directory inode */
+ int di = -1; /* data entry index */
int error; /* error return value */
int index; /* leaf entry index */
xfs_dir2_leaf_t *leaf; /* leaf structure */
@@ -733,6 +734,7 @@ xfs_dir2_leafn_lookup_for_entry(
if (state->extravalid) {
curbp = state->extrablk.bp;
curdb = state->extrablk.blkno;
+ di = state->extrablk.index;
}
/*
* Loop over leaf entries with the right hash value.
@@ -757,27 +759,20 @@ xfs_dir2_leafn_lookup_for_entry(
*/
if (newdb != curdb) {
/*
- * If we had a block before that we aren't saving
- * for a CI name, drop it
+ * If we had a block, drop it
*/
- if (curbp && (args->cmpresult == XFS_CMP_DIFFERENT ||
- curdb != state->extrablk.blkno))
+ if (curbp) {
xfs_trans_brelse(tp, curbp);
+ di = -1;
+ }
/*
- * If needing the block that is saved with a CI match,
- * use it otherwise read in the new data block.
+ * Read in the new data block.
*/
- if (args->cmpresult != XFS_CMP_DIFFERENT &&
- newdb == state->extrablk.blkno) {
- ASSERT(state->extravalid);
- curbp = state->extrablk.bp;
- } else {
- error = xfs_dir3_data_read(tp, dp,
- xfs_dir2_db_to_da(mp, newdb),
- -1, &curbp);
- if (error)
- return error;
- }
+ error = xfs_dir3_data_read(tp, dp,
+ xfs_dir2_db_to_da(mp, newdb),
+ -1, &curbp);
+ if (error)
+ return error;
xfs_dir3_data_check(dp, curbp);
curdb = newdb;
}
@@ -787,53 +782,36 @@ xfs_dir2_leafn_lookup_for_entry(
dep = (xfs_dir2_data_entry_t *)((char *)curbp->b_addr +
xfs_dir2_dataptr_to_off(mp, be32_to_cpu(lep->address)));
/*
- * Compare the entry and if it's an exact match, return
- * EEXIST immediately. If it's the first case-insensitive
- * match, store the block & inode number and continue looking.
+ * Compare the entry and if it's a match, return
+ * EEXIST immediately.
*/
cmp = mp->m_dirnameops->compname(args, dep->name, dep->namelen);
- if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) {
- /* If there is a CI match block, drop it */
- if (args->cmpresult != XFS_CMP_DIFFERENT &&
- curdb != state->extrablk.blkno)
- xfs_trans_brelse(tp, state->extrablk.bp);
+ if (cmp != XFS_CMP_DIFFERENT) {
args->cmpresult = cmp;
args->inumber = be64_to_cpu(dep->inumber);
args->filetype = xfs_dir3_dirent_get_ftype(mp, dep);
- *indexp = index;
- state->extravalid = 1;
- state->extrablk.bp = curbp;
- state->extrablk.blkno = curdb;
- state->extrablk.index = (int)((char *)dep -
- (char *)curbp->b_addr);
- state->extrablk.magic = XFS_DIR2_DATA_MAGIC;
- curbp->b_ops = &xfs_dir3_data_buf_ops;
- xfs_trans_buf_set_type(tp, curbp, XFS_BLFT_DIR_DATA_BUF);
- if (cmp == XFS_CMP_EXACT)
- return XFS_ERROR(EEXIST);
+ error = EEXIST;
+ goto out;
}
}
+ /* Didn't find a match */
+ error = ENOENT;
ASSERT(index == leafhdr.count || (args->op_flags & XFS_DA_OP_OKNOENT));
+out:
if (curbp) {
- if (args->cmpresult == XFS_CMP_DIFFERENT) {
- /* Giving back last used data block. */
- state->extravalid = 1;
- state->extrablk.bp = curbp;
- state->extrablk.index = -1;
- state->extrablk.blkno = curdb;
- state->extrablk.magic = XFS_DIR2_DATA_MAGIC;
- curbp->b_ops = &xfs_dir3_data_buf_ops;
- xfs_trans_buf_set_type(tp, curbp, XFS_BLFT_DIR_DATA_BUF);
- } else {
- /* If the curbp is not the CI match block, drop it */
- if (state->extrablk.bp != curbp)
- xfs_trans_brelse(tp, curbp);
- }
+ /* Giving back last used data block. */
+ state->extravalid = 1;
+ state->extrablk.bp = curbp;
+ state->extrablk.index = di;
+ state->extrablk.blkno = curdb;
+ state->extrablk.magic = XFS_DIR2_DATA_MAGIC;
+ curbp->b_ops = &xfs_dir3_data_buf_ops;
+ xfs_trans_buf_set_type(tp, curbp, XFS_BLFT_DIR_DATA_BUF);
} else {
state->extravalid = 0;
}
*indexp = index;
- return XFS_ERROR(ENOENT);
+ return XFS_ERROR(error);
}

/*
diff --git a/libxfs/xfs_dir2_sf.c b/libxfs/xfs_dir2_sf.c
index 7580333..7b01d43 100644
--- a/libxfs/xfs_dir2_sf.c
+++ b/libxfs/xfs_dir2_sf.c
@@ -833,13 +833,12 @@ xfs_dir2_sf_lookup(
for (i = 0, sfep = xfs_dir2_sf_firstentry(sfp); i < sfp->count;
i++, sfep = xfs_dir3_sf_nextentry(dp->i_mount, sfp, sfep)) {
/*
- * Compare name and if it's an exact match, return the inode
- * number. If it's the first case-insensitive match, store the
- * inode number and continue looking for an exact match.
+ * Compare name and if it's a match, return the inode
+ * number.
*/
cmp = dp->i_mount->m_dirnameops->compname(args, sfep->name,
sfep->namelen);
- if (cmp != XFS_CMP_DIFFERENT && cmp != args->cmpresult) {
+ if (cmp != XFS_CMP_DIFFERENT) {
args->cmpresult = cmp;
args->inumber = xfs_dir3_sfe_get_ino(dp->i_mount,
sfp, sfep);
@@ -848,6 +847,7 @@ xfs_dir2_sf_lookup(
if (cmp == XFS_CMP_EXACT)
return XFS_ERROR(EEXIST);
ci_sfep = sfep;
+ break;
}
}
ASSERT(args->op_flags & XFS_DA_OP_OKNOENT);
--
1.7.12.4
Ben Myers
2014-09-11 20:52:38 UTC
Permalink
From: Olaf Weber <***@sgi.com>

Rename XFS_CMP_CASE to XFS_CMP_MATCH. With unicode filenames and
normalization, different strings will match on other criteria than
case insensitivity.

Signed-off-by: Olaf Weber <***@sgi.com>
---
include/xfs_da_btree.h | 2 +-
libxfs/xfs_dir2.c | 9 ++++++---
libxfs/xfs_dir2_node.c | 2 +-
3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/include/xfs_da_btree.h b/include/xfs_da_btree.h
index e492dca..3d9f9dd 100644
--- a/include/xfs_da_btree.h
+++ b/include/xfs_da_btree.h
@@ -34,7 +34,7 @@ struct zone;
enum xfs_dacmp {
XFS_CMP_DIFFERENT, /* names are completely different */
XFS_CMP_EXACT, /* names are exactly the same */
- XFS_CMP_CASE /* names are same but differ in case */
+ XFS_CMP_MATCH /* names are same but differ in encoding */
};

/*
diff --git a/libxfs/xfs_dir2.c b/libxfs/xfs_dir2.c
index 4c8c836..57e98a3 100644
--- a/libxfs/xfs_dir2.c
+++ b/libxfs/xfs_dir2.c
@@ -72,7 +72,7 @@ xfs_ascii_ci_compname(
continue;
if (tolower(args->name[i]) != tolower(name[i]))
return XFS_CMP_DIFFERENT;
- result = XFS_CMP_CASE;
+ result = XFS_CMP_MATCH;
}

return result;
@@ -248,8 +248,11 @@ xfs_dir_cilookup_result(
{
if (args->cmpresult == XFS_CMP_DIFFERENT)
return ENOENT;
- if (args->cmpresult != XFS_CMP_CASE ||
- !(args->op_flags & XFS_DA_OP_CILOOKUP))
+ if (args->cmpresult == XFS_CMP_EXACT)
+ return EEXIST;
+ ASSERT(args->cmpresult == XFS_CMP_MATCH);
+ /* Only dup the found name if XFS_DA_OP_CILOOKUP is set. */
+ if (!(args->op_flags & XFS_DA_OP_CILOOKUP))
return EEXIST;

args->value = kmem_alloc(len, KM_NOFS | KM_MAYFAIL);
diff --git a/libxfs/xfs_dir2_node.c b/libxfs/xfs_dir2_node.c
index fb27506..550ca99 100644
--- a/libxfs/xfs_dir2_node.c
+++ b/libxfs/xfs_dir2_node.c
@@ -2034,7 +2034,7 @@ xfs_dir2_node_lookup(
error = xfs_da3_node_lookup_int(state, &rval);
if (error)
rval = error;
- else if (rval == ENOENT && args->cmpresult == XFS_CMP_CASE) {
+ else if (rval == ENOENT && args->cmpresult == XFS_CMP_MATCH) {
/* If a CI match, dup the actual name and return EEXIST */
xfs_dir2_data_entry_t *dep;
--
1.7.12.4
Ben Myers
2014-09-11 20:53:56 UTC
Permalink
From: Olaf Weber <***@sgi.com>

Add a normhash callout to the xfs_nameops. This callout takes an xfs_da_args
structure as its argument, and calculates a hash value over the name. It may
in the process create a normalized form of the name, and assign that to the
norm/normlen fields in the xfs_da_args structure.

Changes:
The pointer in kmem_free() was type converted to suppress compiler
warnings.

Signed-off-by: Olaf Weber <***@sgi.com>
---
include/xfs_da_btree.h | 5 ++++-
libxfs/xfs_da_btree.c | 9 ++++++++
libxfs/xfs_dir2.c | 56 +++++++++++++++++++++++++++++++++++++++-----------
3 files changed, 57 insertions(+), 13 deletions(-)

diff --git a/include/xfs_da_btree.h b/include/xfs_da_btree.h
index 3d9f9dd..06b50bf 100644
--- a/include/xfs_da_btree.h
+++ b/include/xfs_da_btree.h
@@ -42,7 +42,9 @@ enum xfs_dacmp {
*/
typedef struct xfs_da_args {
const __uint8_t *name; /* string (maybe not NULL terminated) */
- int namelen; /* length of string (maybe no NULL) */
+ const __uint8_t *norm; /* normalized name (may be NULL) */
+ int namelen; /* length of string (maybe no NULL) */
+ int normlen; /* length of normalized name */
__uint8_t filetype; /* filetype of inode for directories */
__uint8_t *value; /* set of bytes (maybe contain NULLs) */
int valuelen; /* length of value */
@@ -131,6 +133,7 @@ typedef struct xfs_da_state {
*/
struct xfs_nameops {
xfs_dahash_t (*hashname)(struct xfs_name *);
+ int (*normhash)(struct xfs_da_args *);
enum xfs_dacmp (*compname)(struct xfs_da_args *,
const unsigned char *, int);
};
diff --git a/libxfs/xfs_da_btree.c b/libxfs/xfs_da_btree.c
index b731b54..eb97317 100644
--- a/libxfs/xfs_da_btree.c
+++ b/libxfs/xfs_da_btree.c
@@ -2000,8 +2000,17 @@ xfs_default_hashname(
return xfs_da_hashname(name->name, name->len);
}

+STATIC int
+xfs_da_normhash(
+ struct xfs_da_args *args)
+{
+ args->hashval = xfs_da_hashname(args->name, args->namelen);
+ return 0;
+}
+
const struct xfs_nameops xfs_default_nameops = {
.hashname = xfs_default_hashname,
+ .normhash = xfs_da_normhash,
.compname = xfs_da_compname
};

diff --git a/libxfs/xfs_dir2.c b/libxfs/xfs_dir2.c
index 57e98a3..e52d082 100644
--- a/libxfs/xfs_dir2.c
+++ b/libxfs/xfs_dir2.c
@@ -54,6 +54,21 @@ xfs_ascii_ci_hashname(
return hash;
}

+STATIC int
+xfs_ascii_ci_normhash(
+ struct xfs_da_args *args)
+{
+ xfs_dahash_t hash;
+ int i;
+
+ for (i = 0, hash = 0; i < args->namelen; i++)
+ hash = tolower(args->name[i]) ^ rol32(hash, 7);
+
+ args->hashval = hash;
+ return 0;
+}
+
+
STATIC enum xfs_dacmp
xfs_ascii_ci_compname(
struct xfs_da_args *args,
@@ -80,6 +95,7 @@ xfs_ascii_ci_compname(

static struct xfs_nameops xfs_ascii_ci_nameops = {
.hashname = xfs_ascii_ci_hashname,
+ .normhash = xfs_ascii_ci_normhash,
.compname = xfs_ascii_ci_compname,
};

@@ -211,7 +227,6 @@ xfs_dir_createname(
args.name = name->name;
args.namelen = name->len;
args.filetype = name->type;
- args.hashval = dp->i_mount->m_dirnameops->hashname(name);
args.inumber = inum;
args.dp = dp;
args.firstblock = first;
@@ -220,19 +235,24 @@ xfs_dir_createname(
args.whichfork = XFS_DATA_FORK;
args.trans = tp;
args.op_flags = XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT;
+ if ((rval = dp->i_mount->m_dirnameops->normhash(&args)))
+ return rval;

if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL)
rval = xfs_dir2_sf_addname(&args);
else if ((rval = xfs_dir2_isblock(tp, dp, &v)))
- return rval;
+ goto out_free;
else if (v)
rval = xfs_dir2_block_addname(&args);
else if ((rval = xfs_dir2_isleaf(tp, dp, &v)))
- return rval;
+ goto out_free;
else if (v)
rval = xfs_dir2_leaf_addname(&args);
else
rval = xfs_dir2_node_addname(&args);
+out_free:
+ if (args.norm)
+ kmem_free((void *)args.norm);
return rval;
}

@@ -289,22 +309,23 @@ xfs_dir_lookup(
args.name = name->name;
args.namelen = name->len;
args.filetype = name->type;
- args.hashval = dp->i_mount->m_dirnameops->hashname(name);
args.dp = dp;
args.whichfork = XFS_DATA_FORK;
args.trans = tp;
args.op_flags = XFS_DA_OP_OKNOENT;
if (ci_name)
args.op_flags |= XFS_DA_OP_CILOOKUP;
+ if ((rval = dp->i_mount->m_dirnameops->normhash(&args)))
+ return rval;

if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL)
rval = xfs_dir2_sf_lookup(&args);
else if ((rval = xfs_dir2_isblock(tp, dp, &v)))
- return rval;
+ goto out_free;
else if (v)
rval = xfs_dir2_block_lookup(&args);
else if ((rval = xfs_dir2_isleaf(tp, dp, &v)))
- return rval;
+ goto out_free;
else if (v)
rval = xfs_dir2_leaf_lookup(&args);
else
@@ -318,6 +339,9 @@ xfs_dir_lookup(
ci_name->len = args.valuelen;
}
}
+out_free:
+ if (args.norm)
+ kmem_free((void *)args.norm);
return rval;
}

@@ -345,7 +369,6 @@ xfs_dir_removename(
args.name = name->name;
args.namelen = name->len;
args.filetype = name->type;
- args.hashval = dp->i_mount->m_dirnameops->hashname(name);
args.inumber = ino;
args.dp = dp;
args.firstblock = first;
@@ -353,19 +376,24 @@ xfs_dir_removename(
args.total = total;
args.whichfork = XFS_DATA_FORK;
args.trans = tp;
+ if ((rval = dp->i_mount->m_dirnameops->normhash(&args)))
+ return rval;

if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL)
rval = xfs_dir2_sf_removename(&args);
else if ((rval = xfs_dir2_isblock(tp, dp, &v)))
- return rval;
+ goto out_free;
else if (v)
rval = xfs_dir2_block_removename(&args);
else if ((rval = xfs_dir2_isleaf(tp, dp, &v)))
- return rval;
+ goto out_free;
else if (v)
rval = xfs_dir2_leaf_removename(&args);
else
rval = xfs_dir2_node_removename(&args);
+out_free:
+ if (args.norm)
+ kmem_free((void *)args.norm);
return rval;
}

@@ -395,7 +423,6 @@ xfs_dir_replace(
args.name = name->name;
args.namelen = name->len;
args.filetype = name->type;
- args.hashval = dp->i_mount->m_dirnameops->hashname(name);
args.inumber = inum;
args.dp = dp;
args.firstblock = first;
@@ -403,19 +430,24 @@ xfs_dir_replace(
args.total = total;
args.whichfork = XFS_DATA_FORK;
args.trans = tp;
+ if ((rval = dp->i_mount->m_dirnameops->normhash(&args)))
+ return rval;

if (dp->i_d.di_format == XFS_DINODE_FMT_LOCAL)
rval = xfs_dir2_sf_replace(&args);
else if ((rval = xfs_dir2_isblock(tp, dp, &v)))
- return rval;
+ goto out_free;
else if (v)
rval = xfs_dir2_block_replace(&args);
else if ((rval = xfs_dir2_isleaf(tp, dp, &v)))
- return rval;
+ goto out_free;
else if (v)
rval = xfs_dir2_leaf_replace(&args);
else
rval = xfs_dir2_node_replace(&args);
+out_free:
+ if (args.norm)
+ kmem_free((void *)args.norm);
return rval;
}
--
1.7.12.4
Ben Myers
2014-09-11 20:55:34 UTC
Permalink
From: Olaf Weber <***@sgi.com>

With the introduction of the xfs_nameops.normhash callout, all uses of the
hashname callout now occur in places where an xfs_name structure must be
explicitly created just to match the parameter passing convention of this
callout. Change the arguments to a const unsigned char * and int instead.

Signed-off-by: Olaf Weber <***@sgi.com>
---
db/check.c | 6 ++----
include/xfs_da_btree.h | 2 +-
libxfs/xfs_da_btree.c | 9 +--------
libxfs/xfs_dir2.c | 10 ++++++----
libxfs/xfs_dir2_block.c | 5 +----
libxfs/xfs_dir2_data.c | 6 ++----
repair/phase6.c | 2 +-
7 files changed, 14 insertions(+), 26 deletions(-)

diff --git a/db/check.c b/db/check.c
index 4fd9fd0..49359d7 100644
--- a/db/check.c
+++ b/db/check.c
@@ -2212,7 +2212,6 @@ process_data_dir_v2(
int stale = 0;
int tag_err;
__be16 *tagp;
- struct xfs_name xname;

data = iocur_top->data;
block = iocur_top->data;
@@ -2323,9 +2322,8 @@ process_data_dir_v2(
tag_err += be16_to_cpu(*tagp) != (char *)dep - (char *)data;
addr = xfs_dir2_db_off_to_dataptr(mp, db,
(char *)dep - (char *)data);
- xname.name = dep->name;
- xname.len = dep->namelen;
- dir_hash_add(mp->m_dirnameops->hashname(&xname), addr);
+ dir_hash_add(mp->m_dirnameops->hashname(dep->name,
+ dep->namelen), addr);
ptr += xfs_dir3_data_entsize(mp, dep->namelen);
count++;
lastfree = 0;
diff --git a/include/xfs_da_btree.h b/include/xfs_da_btree.h
index 06b50bf..9674bed 100644
--- a/include/xfs_da_btree.h
+++ b/include/xfs_da_btree.h
@@ -132,7 +132,7 @@ typedef struct xfs_da_state {
* Name ops for directory and/or attr name operations
*/
struct xfs_nameops {
- xfs_dahash_t (*hashname)(struct xfs_name *);
+ xfs_dahash_t (*hashname)(const unsigned char *, int);
int (*normhash)(struct xfs_da_args *);
enum xfs_dacmp (*compname)(struct xfs_da_args *,
const unsigned char *, int);
diff --git a/libxfs/xfs_da_btree.c b/libxfs/xfs_da_btree.c
index eb97317..7be5eaf 100644
--- a/libxfs/xfs_da_btree.c
+++ b/libxfs/xfs_da_btree.c
@@ -1993,13 +1993,6 @@ xfs_da_compname(
XFS_CMP_EXACT : XFS_CMP_DIFFERENT;
}

-static xfs_dahash_t
-xfs_default_hashname(
- struct xfs_name *name)
-{
- return xfs_da_hashname(name->name, name->len);
-}
-
STATIC int
xfs_da_normhash(
struct xfs_da_args *args)
@@ -2009,7 +2002,7 @@ xfs_da_normhash(
}

const struct xfs_nameops xfs_default_nameops = {
- .hashname = xfs_default_hashname,
+ .hashname = xfs_da_hashname,
.normhash = xfs_da_normhash,
.compname = xfs_da_compname
};
diff --git a/libxfs/xfs_dir2.c b/libxfs/xfs_dir2.c
index e52d082..1893931 100644
--- a/libxfs/xfs_dir2.c
+++ b/libxfs/xfs_dir2.c
@@ -43,13 +43,14 @@ const unsigned char xfs_mode_to_ftype[S_IFMT >> S_SHIFT] = {
*/
STATIC xfs_dahash_t
xfs_ascii_ci_hashname(
- struct xfs_name *name)
+ const unsigned char *name,
+ int len)
{
xfs_dahash_t hash;
int i;

- for (i = 0, hash = 0; i < name->len; i++)
- hash = tolower(name->name[i]) ^ rol32(hash, 7);
+ for (i = 0, hash = 0; i < len; i++)
+ hash = tolower(name[i]) ^ rol32(hash, 7);

return hash;
}
@@ -475,7 +476,8 @@ xfs_dir_canenter(
args.name = name->name;
args.namelen = name->len;
args.filetype = name->type;
- args.hashval = dp->i_mount->m_dirnameops->hashname(name);
+ args.hashval = dp->i_mount->m_dirnameops->hashname(name->name,
+ name->len);
args.dp = dp;
args.whichfork = XFS_DATA_FORK;
args.trans = tp;
diff --git a/libxfs/xfs_dir2_block.c b/libxfs/xfs_dir2_block.c
index 2880431..1a8b5f5 100644
--- a/libxfs/xfs_dir2_block.c
+++ b/libxfs/xfs_dir2_block.c
@@ -1047,7 +1047,6 @@ xfs_dir2_sf_to_block(
xfs_dir2_sf_hdr_t *sfp; /* shortform header */
__be16 *tagp; /* end of data entry */
xfs_trans_t *tp; /* transaction pointer */
- struct xfs_name name;
struct xfs_ifork *ifp;

trace_xfs_dir2_sf_to_block(args);
@@ -1205,10 +1204,8 @@ xfs_dir2_sf_to_block(
tagp = xfs_dir3_data_entry_tag_p(mp, dep);
*tagp = cpu_to_be16((char *)dep - (char *)hdr);
xfs_dir2_data_log_entry(tp, bp, dep);
- name.name = sfep->name;
- name.len = sfep->namelen;
blp[2 + i].hashval = cpu_to_be32(mp->m_dirnameops->
- hashname(&name));
+ hashname(sfep->name, sfep->namelen));
blp[2 + i].address = cpu_to_be32(xfs_dir2_byte_to_dataptr(mp,
(char *)dep - (char *)hdr));
offset = (int)((char *)(tagp + 1) - (char *)hdr);
diff --git a/libxfs/xfs_dir2_data.c b/libxfs/xfs_dir2_data.c
index dc9df4d..9b3f750 100644
--- a/libxfs/xfs_dir2_data.c
+++ b/libxfs/xfs_dir2_data.c
@@ -46,7 +46,6 @@ __xfs_dir3_data_check(
xfs_mount_t *mp; /* filesystem mount point */
char *p; /* current data position */
int stale; /* count of stale leaves */
- struct xfs_name name;

mp = bp->b_target->bt_mount;
hdr = bp->b_addr;
@@ -142,9 +141,8 @@ __xfs_dir3_data_check(
addr = xfs_dir2_db_off_to_dataptr(mp, mp->m_dirdatablk,
(xfs_dir2_data_aoff_t)
((char *)dep - (char *)hdr));
- name.name = dep->name;
- name.len = dep->namelen;
- hash = mp->m_dirnameops->hashname(&name);
+ hash = mp->m_dirnameops->
+ hashname(dep->name, dep->namelen);
for (i = 0; i < be32_to_cpu(btp->count); i++) {
if (be32_to_cpu(lep[i].address) == addr &&
be32_to_cpu(lep[i].hashval) == hash)
diff --git a/repair/phase6.c b/repair/phase6.c
index f13069f..f374fd0 100644
--- a/repair/phase6.c
+++ b/repair/phase6.c
@@ -195,7 +195,7 @@ dir_hash_add(
dup = 0;

if (!junk) {
- hash = mp->m_dirnameops->hashname(&xname);
+ hash = mp->m_dirnameops->hashname(name, namelen);
byhash = DIR_HASH_FUNC(hashtab, hash);

/*
--
1.7.12.4
Ben Myers
2014-09-11 20:56:31 UTC
Permalink
From: Olaf Weber <***@sgi.com>

When UTF-8 support is enabled, the xfs_dir_ci_inode_operations must be
installed. Add xfs_sb_version_hasci(), which tests both the borgbit and
the utf8bit, and returns true if at least one of them is set. Replace
calls to xfs_sb_version_hasasciici() as needed.

Signed-off-by: Olaf Weber <***@sgi.com>
---
include/xfs_fs.h | 2 +-
include/xfs_sb.h | 25 ++++++++++++++++++++++++-
2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/include/xfs_fs.h b/include/xfs_fs.h
index 59c40fc..1be539d 100644
--- a/include/xfs_fs.h
+++ b/include/xfs_fs.h
@@ -239,7 +239,7 @@ typedef struct xfs_fsop_resblks {
#define XFS_FSOP_GEOM_FLAGS_V5SB 0x8000 /* version 5 superblock */
#define XFS_FSOP_GEOM_FLAGS_FTYPE 0x10000 /* inode directory types */
#define XFS_FSOP_GEOM_FLAGS_FINOBT 0x20000 /* free inode btree */
-
+#define XFS_FSOP_GEOM_FLAGS_UTF8 0x40000 /* utf8 filenames */

/*
* Minimum and maximum sizes need for growth checks.
diff --git a/include/xfs_sb.h b/include/xfs_sb.h
index 950d1ea..5ac7f06 100644
--- a/include/xfs_sb.h
+++ b/include/xfs_sb.h
@@ -82,6 +82,8 @@ struct xfs_trans;
#define XFS_SB_VERSION2_RESERVED4BIT 0x00000004
#define XFS_SB_VERSION2_ATTR2BIT 0x00000008 /* Inline attr rework */
#define XFS_SB_VERSION2_PARENTBIT 0x00000010 /* parent pointers */
+#define XFS_SB_VERSION2_PARENTBIT 0x00000010 /* parent pointers */
+#define XFS_SB_VERSION2_UTF8BIT 0x00000020 /* utf8 names */
#define XFS_SB_VERSION2_PROJID32BIT 0x00000080 /* 32 bit project id */
#define XFS_SB_VERSION2_CRCBIT 0x00000100 /* metadata CRCs */
#define XFS_SB_VERSION2_FTYPE 0x00000200 /* inode type in dir */
@@ -89,6 +91,7 @@ struct xfs_trans;
#define XFS_SB_VERSION2_OKREALFBITS \
(XFS_SB_VERSION2_LAZYSBCOUNTBIT | \
XFS_SB_VERSION2_ATTR2BIT | \
+ XFS_SB_VERSION2_UTF8BIT | \
XFS_SB_VERSION2_PROJID32BIT | \
XFS_SB_VERSION2_FTYPE)
#define XFS_SB_VERSION2_OKSASHFBITS \
@@ -600,8 +603,10 @@ xfs_sb_has_ro_compat_feature(
}

#define XFS_SB_FEAT_INCOMPAT_FTYPE (1 << 0) /* filetype in dirent */
+#define XFS_SB_FEAT_INCOMPAT_UTF8 (1 << 1) /* utf-8 name support */
#define XFS_SB_FEAT_INCOMPAT_ALL \
- (XFS_SB_FEAT_INCOMPAT_FTYPE)
+ (XFS_SB_FEAT_INCOMPAT_FTYPE | \
+ XFS_SB_FEAT_INCOMPAT_UTF8)

#define XFS_SB_FEAT_INCOMPAT_UNKNOWN ~XFS_SB_FEAT_INCOMPAT_ALL
static inline bool
@@ -649,6 +654,24 @@ static inline int xfs_sb_version_hasfinobt(xfs_sb_t *sbp)
(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FINOBT);
}

+static inline int xfs_sb_version_hasutf8(xfs_sb_t *sbp)
+{
+ return (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5 &&
+ xfs_sb_has_incompat_feature(sbp, XFS_SB_FEAT_INCOMPAT_UTF8)) ||
+ (xfs_sb_version_hasmorebits(sbp) &&
+ (sbp->sb_features2 & XFS_SB_VERSION2_UTF8BIT));
+}
+
+/*
+ * Special case: there are a number of places where we need to test
+ * both the borgbit and the utf8bit, and take the same action if
+ * either of those is set.
+ */
+static inline int xfs_sb_version_hasci(xfs_sb_t *sbp)
+{
+ return xfs_sb_version_hasasciici(sbp) || xfs_sb_version_hasutf8(sbp);
+}
+
/*
* end of superblock version macros
*/
--
1.7.12.4
Ben Myers
2014-09-11 20:57:51 UTC
Permalink
From: Olaf Weber <***@sgi.com>

Add files from the Unicode Character Database, version 7.0.0, to the source.
A helper program that generates a trie used for normalization from these
files is part of a separate commit.

Signed-off-by: Olaf Weber <***@sgi.com>

---
[v2: removed large unicode files. download them as below. -bpm]

cd support/ucd-7.0.0
wget http://www.unicode.org/Public/7.0.0/ucd/CaseFolding.txt
wget http://www.unicode.org/Public/7.0.0/ucd/DerivedAge.txt
wget http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedCombiningClass.txt
wget http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt
wget http://www.unicode.org/Public/7.0.0/ucd/NormalizationCorrections.txt
wget http://www.unicode.org/Public/7.0.0/ucd/NormalizationTest.txt
wget http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
---
support/ucd-7.0.0/README | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)
create mode 100644 support/ucd-7.0.0/README

diff --git a/support/ucd-7.0.0/README b/support/ucd-7.0.0/README
new file mode 100644
index 0000000..d713e66
--- /dev/null
+++ b/support/ucd-7.0.0/README
@@ -0,0 +1,33 @@
+The files in this directory are part of the Unicode Character Database
+for version 7.0.0 of the Unicode standard.
+
+The full set of files can be found here:
+
+ http://www.unicode.org/Public/7.0.0/ucd/
+
+The latest released version of the UCD can be found here:
+
+ http://www.unicode.org/Public/UCD/latest/
+
+The files in this directory are identical, except that they have been
+renamed with a suffix indicating the unicode version.
+
+Individual source links:
+
+ http://www.unicode.org/Public/7.0.0/ucd/CaseFolding.txt
+ http://www.unicode.org/Public/7.0.0/ucd/DerivedAge.txt
+ http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedCombiningClass.txt
+ http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt
+ http://www.unicode.org/Public/7.0.0/ucd/NormalizationCorrections.txt
+ http://www.unicode.org/Public/7.0.0/ucd/NormalizationTest.txt
+ http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
+
+md5sums
+
+ 9a92b2bfe56c6719def926bab524fefd CaseFolding-7.0.0.txt
+ 07b8b1027eb824cf0835314e94f23d2e DerivedAge-7.0.0.txt
+ 90c3340b16821e2f2153acdbe6fc6180 DerivedCombiningClass-7.0.0.txt
+ c41c0601f808116f623de47110ed4f93 DerivedCoreProperties-7.0.0.txt
+ 522720ddfc150d8e63a2518634829bce NormalizationCorrections-7.0.0.txt
+ 1f35175eba4a2ad795db489f789ae352 NormalizationTest-7.0.0.txt
+ c8355655731d75e6a3de8c20d7e601ba UnicodeData-7.0.0.txt
--
1.7.12.4
Ben Myers
2014-09-11 20:59:01 UTC
Permalink
From: Olaf Weber <***@sgi.com>

mkutf8data.c is the source for a program that generates utf8data.h, which
contains the trie that utf8norm.c uses. The trie is generated from the
Unicode 7.0.0 data files. The format of the utf8data[] table is described
in utf8norm.c.

Supporting functions for UTF-8 normalization are in utf8norm.c with the
header utf8norm.h. Two normalization forms are supported: nfkdi and nfkdicf.

nfkdi:
- Apply unicode normalization form NFKD.
- Remove any Default_Ignorable_Code_Point.

nfkdicf:
- Apply unicode normalization form NFKD.
- Remove any Default_Ignorable_Code_Point.
- Apply a full casefold (C + F).

For the purposes of the code, a string is valid UTF-8 if:

- The values encoded are 0x1..0x10FFFF.
- The surrogate codepoints 0xD800..0xDFFFF are not encoded.
- The shortest possible encoding is used for all values.

The supporting functions work on null-terminated strings (utf8 prefix) and
on length-limited strings (utf8n prefix).

Signed-off-by: Olaf Weber <***@sgi.com>
---
include/utf8norm.h | 111 ++
libxfs/utf8norm.c | 628 ++++++++++
support/mkutf8data.c | 3232 ++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 3971 insertions(+)
create mode 100644 include/utf8norm.h
create mode 100644 libxfs/utf8norm.c
create mode 100644 support/mkutf8data.c

diff --git a/include/utf8norm.h b/include/utf8norm.h
new file mode 100644
index 0000000..6aa3391
--- /dev/null
+++ b/include/utf8norm.h
@@ -0,0 +1,111 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef UTF8NORM_H
+#define UTF8NORM_H
+
+/* An opaque type used to determine the normalization in use. */
+typedef const struct utf8data *utf8data_t;
+
+/* Encoding a unicode version number as a single unsigned int. */
+#define UNICODE_MAJ_SHIFT (16)
+#define UNICODE_MIN_SHIFT (8)
+
+#define UNICODE_AGE(MAJ,MIN,REV) \
+ (((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) | \
+ ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) | \
+ ((unsigned int)(REV)))
+
+/* Highest unicode version supported by the data tables. */
+extern const unsigned int utf8version;
+
+/*
+ * Look for the correct utf8data_t for a unicode version.
+ * Returns NULL if the version requested is too new.
+ *
+ * Two normalization forms are supported: nfkdi and nfkdicf.
+ *
+ * nfkdi:
+ * - Apply unicode normalization form NFKD.
+ * - Remove any Default_Ignorable_Code_Point.
+ *
+ * nfkdicf:
+ * - Apply unicode normalization form NFKD.
+ * - Remove any Default_Ignorable_Code_Point.
+ * - Apply a full casefold (C + F).
+ */
+extern utf8data_t utf8nfkdi(unsigned int);
+extern utf8data_t utf8nfkdicf(unsigned int);
+
+/*
+ * Determine the maximum age of any unicode character in the string.
+ * Returns 0 if only unassigned code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern int utf8agemax(utf8data_t, const char *);
+extern int utf8nagemax(utf8data_t, const char *, size_t);
+
+/*
+ * Determine the minimum age of any unicode character in the string.
+ * Returns 0 if any unassigned code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern int utf8agemin(utf8data_t, const char *);
+extern int utf8nagemin(utf8data_t, const char *, size_t);
+
+/*
+ * Determine the length of the normalized from of the string,
+ * excluding any terminating NULL byte.
+ * Returns 0 if only ignorable code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern ssize_t utf8len(utf8data_t, const char *);
+extern ssize_t utf8nlen(utf8data_t, const char *, size_t);
+
+/*
+ * Cursor structure used by the normalizer.
+ */
+struct utf8cursor {
+ utf8data_t data;
+ const char *s;
+ const char *p;
+ const char *ss;
+ const char *sp;
+ unsigned int len;
+ unsigned int slen;
+ short int ccc;
+ short int nccc;
+};
+
+/*
+ * Initialize a utf8cursor to normalize a string.
+ * Returns 0 on success.
+ * Returns -1 on failure.
+ */
+extern int utf8cursor(struct utf8cursor *, utf8data_t, const char *);
+extern int utf8ncursor(struct utf8cursor *, utf8data_t, const char *, size_t);
+
+/*
+ * Get the next byte in the normalization.
+ * Returns a value > 0 && < 256 on success.
+ * Returns 0 when the end of the normalization is reached.
+ * Returns -1 if the string being normalized is not valid UTF-8.
+ */
+extern int utf8byte(struct utf8cursor *);
+
+#endif /* UTF8NORM_H */
diff --git a/libxfs/utf8norm.c b/libxfs/utf8norm.c
new file mode 100644
index 0000000..6232d1a
--- /dev/null
+++ b/libxfs/utf8norm.c
@@ -0,0 +1,628 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "xfs.h"
+#include "xfs_types.h"
+#include <utf8norm.h>
+
+struct utf8data {
+ unsigned int maxage;
+ unsigned int offset;
+};
+
+#define __INCLUDED_FROM_UTF8NORM_C__
+#include <utf8data.h>
+#undef __INCLUDED_FROM_UTF8NORM_C__
+
+/*
+ * UTF-8 valid ranges.
+ *
+ * The UTF-8 encoding spreads the bits of a 32bit word over several
+ * bytes. This table gives the ranges that can be held and how they'd
+ * be represented.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * There is an additional requirement on UTF-8, in that only the
+ * shortest representation of a 32bit value is to be used. A decoder
+ * must not decode sequences that do not satisfy this requirement.
+ * Thus the allowed ranges have a lower bound.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * Actual unicode characters are limited to the range 0x0 - 0x10FFFF,
+ * 17 planes of 65536 values. This limits the sequences actually seen
+ * even more, to just the following.
+ *
+ * 0 - 0x7F: 0 - 0x7F
+ * 0x80 - 0x7FF: 0xC2 0x80 - 0xDF 0xBF
+ * 0x800 - 0xFFFF: 0xE0 0xA0 0x80 - 0xEF 0xBF 0xBF
+ * 0x10000 - 0x10FFFF: 0xF0 0x90 0x80 0x80 - 0xF4 0x8F 0xBF 0xBF
+ *
+ * Within those ranges the surrogates 0xD800 - 0xDFFF are not allowed.
+ *
+ * Note that the longest sequence seen with valid usage is 4 bytes,
+ * the same a single UTF-32 character. This makes the UTF-8
+ * representation of Unicode strictly smaller than UTF-32.
+ *
+ * The shortest sequence requirement was introduced by:
+ * Corrigendum #1: UTF-8 Shortest Form
+ * It can be found here:
+ * http://www.unicode.org/versions/corrigendum1.html
+ *
+ */
+
+/*
+ * Return the number of bytes used by the current UTF-8 sequence.
+ * Assumes the input points to the first byte of a valid UTF-8
+ * sequence.
+ */
+static inline int
+utf8clen(const char *s)
+{
+ unsigned char c = *s;
+ return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0);
+}
+
+/*
+ * utf8trie_t
+ *
+ * A compact binary tree, used to decode UTF-8 characters.
+ *
+ * Internal nodes are one byte for the node itself, and up to three
+ * bytes for an offset into the tree. The first byte contains the
+ * following information:
+ * NEXTBYTE - flag - advance to next byte if set
+ * BITNUM - 3 bit field - the bit number to tested
+ * OFFLEN - 2 bit field - number of bytes in the offset
+ * if offlen == 0 (non-branching node)
+ * RIGHTPATH - 1 bit field - set if the following node is for the
+ * right-hand path (tested bit is set)
+ * TRIENODE - 1 bit field - set if the following node is an internal
+ * node, otherwise it is a leaf node
+ * if offlen != 0 (branching node)
+ * LEFTNODE - 1 bit field - set if the left-hand node is internal
+ * RIGHTNODE - 1 bit field - set if the right-hand node is internal
+ *
+ * Due to the way utf8 works, there cannot be branching nodes with
+ * NEXTBYTE set, and moreover those nodes always have a righthand
+ * descendant.
+ */
+typedef const unsigned char utf8trie_t;
+#define BITNUM 0x07
+#define NEXTBYTE 0x08
+#define OFFLEN 0x30
+#define OFFLEN_SHIFT 4
+#define RIGHTPATH 0x40
+#define TRIENODE 0x80
+#define RIGHTNODE 0x40
+#define LEFTNODE 0x80
+
+/*
+ * utf8leaf_t
+ *
+ * The leaves of the trie are embedded in the trie, and so the same
+ * underlying datatype: unsigned char.
+ *
+ * leaf[0]: The unicode version, stored as a generation number that is
+ * an index into utf8agetab[]. With this we can filter code
+ * points based on the unicode version in which they were
+ * defined. The CCC of a non-defined code point is 0.
+ * leaf[1]: Canonical Combining Class. During normalization, we need
+ * to do a stable sort into ascending order of all characters
+ * with a non-zero CCC that occur between two characters with
+ * a CCC of 0, or at the begin or end of a string.
+ * The unicode standard guarantees that all CCC values are
+ * between 0 and 254 inclusive, which leaves 255 available as
+ * a special value.
+ * Code points with CCC 0 are known as stoppers.
+ * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the
+ * start of a NUL-terminated string that is the decomposition
+ * of the character.
+ * The CCC of a decomposable character is the same as the CCC
+ * of the first character of its decomposition.
+ * Some characters decompose as the empty string: these are
+ * characters with the Default_Ignorable_Code_Point property.
+ * These do affect normalization, as they all have CCC 0.
+ *
+ * The decompositions in the trie have been fully expanded.
+ *
+ * Casefolding, if applicable, is also done using decompositions.
+ *
+ * The trie is constructed in such a way that leaves exist for all
+ * UTF-8 sequences that match the criteria from the "UTF-8 valid
+ * ranges" comment above, and only for those sequences. Therefore a
+ * lookup in the trie can be used to validate the UTF-8 input.
+ */
+typedef const unsigned char utf8leaf_t;
+
+#define LEAF_GEN(LEAF) ((LEAF)[0])
+#define LEAF_CCC(LEAF) ((LEAF)[1])
+#define LEAF_STR(LEAF) ((const char*)((LEAF) + 2))
+
+#define MINCCC (0)
+#define MAXCCC (254)
+#define STOPPER (0)
+#define DECOMPOSE (255)
+
+/*
+ * Use trie to scan s, touching at most len bytes.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * A non-NULL return guarantees that the UTF-8 sequence starting at s
+ * is well-formed and corresponds to a known unicode code point. The
+ * shorthand for this will be "is valid UTF-8 unicode".
+ */
+static utf8leaf_t *
+utf8nlookup(utf8data_t data, const char *s, size_t len)
+{
+ utf8trie_t *trie = utf8data + data->offset;
+ int offlen;
+ int offset;
+ int mask;
+ int node;
+
+ if (!data)
+ return NULL;
+ if (len == 0)
+ return NULL;
+ node = 1;
+ while (node) {
+ offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT;
+ if (*trie & NEXTBYTE) {
+ if (--len == 0)
+ return NULL;
+ s++;
+ }
+ mask = 1 << (*trie & BITNUM);
+ if (*s & mask) {
+ /* Right leg */
+ if (offlen) {
+ /* Right node at offset of trie */
+ node = (*trie & RIGHTNODE);
+ offset = trie[offlen];
+ while (--offlen) {
+ offset <<= 8;
+ offset |= trie[offlen];
+ }
+ trie += offset;
+ } else if (*trie & RIGHTPATH) {
+ /* Right node after this node */
+ node = (*trie & TRIENODE);
+ trie++;
+ } else {
+ /* No right node. */
+ node = 0;
+ trie = NULL;
+ }
+ } else {
+ /* Left leg */
+ if (offlen) {
+ /* Left node after this node. */
+ node = (*trie & LEFTNODE);
+ trie += offlen + 1;
+ } else if (*trie & RIGHTPATH) {
+ /* No left node. */
+ node = 0;
+ trie = NULL;
+ } else {
+ /* Left node after this node */
+ node = (*trie & TRIENODE);
+ trie++;
+ }
+ }
+ }
+ return trie;
+}
+
+/*
+ * Use trie to scan s.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * Forwards to utf8nlookup().
+ */
+static utf8leaf_t *
+utf8lookup(utf8data_t data, const char *s)
+{
+ return utf8nlookup(data, s, (size_t)-1);
+}
+
+/*
+ * Maximum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if only non-assigned code points are used.
+ */
+int
+utf8agemax(utf8data_t data, const char *s)
+{
+ utf8leaf_t *leaf;
+ int age = 0;
+ int leaf_age;
+
+ if (!data)
+ return -1;
+ while (*s) {
+ if (!(leaf = utf8lookup(data, s)))
+ return -1;
+ leaf_age = utf8agetab[LEAF_GEN(leaf)];
+ if (leaf_age <= data->maxage && leaf_age > age)
+ age = leaf_age;
+ s += utf8clen(s);
+ }
+ return age;
+}
+
+/*
+ * Minimum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if non-assigned code points are used.
+ */
+int
+utf8agemin(utf8data_t data, const char *s)
+{
+ utf8leaf_t *leaf;
+ int age = data->maxage;
+ int leaf_age;
+
+ if (!data)
+ return -1;
+ while (*s) {
+ if (!(leaf = utf8lookup(data, s)))
+ return -1;
+ leaf_age = utf8agetab[LEAF_GEN(leaf)];
+ if (leaf_age <= data->maxage && leaf_age < age)
+ age = leaf_age;
+ s += utf8clen(s);
+ }
+ return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemax(utf8data_t data, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ int age = 0;
+ int leaf_age;
+
+ if (!data)
+ return -1;
+ while (len && *s) {
+ if (!(leaf = utf8nlookup(data, s, len)))
+ return -1;
+ leaf_age = utf8agetab[LEAF_GEN(leaf)];
+ if (leaf_age <= data->maxage && leaf_age > age)
+ age = leaf_age;
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemin(utf8data_t data, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ int leaf_age;
+ int age = data->maxage;
+
+ if (!data)
+ return -1;
+ while (len && *s) {
+ if (!(leaf = utf8nlookup(data, s, len)))
+ return -1;
+ leaf_age = utf8agetab[LEAF_GEN(leaf)];
+ if (leaf_age <= data->maxage && leaf_age < age)
+ age = leaf_age;
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return age;
+}
+
+/*
+ * Length of the normalization of s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ *
+ * A string of Default_Ignorable_Code_Point has length 0.
+ */
+ssize_t
+utf8len(utf8data_t data, const char *s)
+{
+ utf8leaf_t *leaf;
+ size_t ret = 0;
+
+ if (!data)
+ return -1;
+ while (*s) {
+ if (!(leaf = utf8lookup(data, s)))
+ return -1;
+ if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
+ ret += utf8clen(s);
+ else if (LEAF_CCC(leaf) == DECOMPOSE)
+ ret += strlen(LEAF_STR(leaf));
+ else
+ ret += utf8clen(s);
+ s += utf8clen(s);
+ }
+ return ret;
+}
+
+/*
+ * Length of the normalization of s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+ssize_t
+utf8nlen(utf8data_t data, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ size_t ret = 0;
+
+ if (!data)
+ return -1;
+ while (len && *s) {
+ if (!(leaf = utf8nlookup(data, s, len)))
+ return -1;
+ if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
+ ret += utf8clen(s);
+ else if (LEAF_CCC(leaf) == DECOMPOSE)
+ ret += strlen(LEAF_STR(leaf));
+ else
+ ret += utf8clen(s);
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return ret;
+}
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ * u8c : pointer to cursor.
+ * data : utf8data_t to use for normalization.
+ * s : string.
+ * len : length of s.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8ncursor(
+ struct utf8cursor *u8c,
+ utf8data_t data,
+ const char *s,
+ size_t len)
+{
+ if (!data)
+ return -1;
+ if (!s)
+ return -1;
+ u8c->data = data;
+ u8c->s = s;
+ u8c->p = NULL;
+ u8c->ss = NULL;
+ u8c->sp = NULL;
+ u8c->len = len;
+ u8c->slen = 0;
+ u8c->ccc = STOPPER;
+ u8c->nccc = STOPPER;
+ /* Check we didn't clobber the maximum length. */
+ if (u8c->len != len)
+ return -1;
+ /* The first byte of s may not be an utf8 continuation. */
+ if (len > 0 && (*s & 0xC0) == 0x80)
+ return -1;
+ return 0;
+}
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ * u8c : pointer to cursor.
+ * data : utf8data_t to use for normalization.
+ * s : NUL-terminated string.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8cursor(
+ struct utf8cursor *u8c,
+ utf8data_t data,
+ const char *s)
+{
+ return utf8ncursor(u8c, data, s, (unsigned int)-1);
+}
+
+/*
+ * Get one byte from the normalized form of the string described by u8c.
+ *
+ * Returns the byte cast to an unsigned char on succes, and -1 on failure.
+ *
+ * The cursor keeps track of the location in the string in u8c->s.
+ * When a character is decomposed, the current location is stored in
+ * u8c->p, and u8c->s is set to the start of the decomposition. Note
+ * that bytes from a decomposition do not count against u8c->len.
+ *
+ * Characters are emitted if they match the current CCC in u8c->ccc.
+ * Hitting end-of-string while u8c->ccc == STOPPER means we're done,
+ * and the function returns 0 in that case.
+ *
+ * Sorting by CCC is done by repeatedly scanning the string. The
+ * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at
+ * the start of the scan. The first pass finds the lowest CCC to be
+ * emitted and stores it in u8c->nccc, the second pass emits the
+ * characters with this CCC and finds the next lowest CCC. This limits
+ * the number of passes to 1 + the number of different CCCs in the
+ * sequence being scanned.
+ *
+ * Therefore:
+ * u8c->p != NULL -> a decomposition is being scanned.
+ * u8c->ss != NULL -> this is a repeating scan.
+ * u8c->ccc == -1 -> this is the first scan of a repeating scan.
+ */
+int
+utf8byte(struct utf8cursor *u8c)
+{
+ utf8leaf_t *leaf;
+ int ccc;
+
+ for (;;) {
+ /* Check for the end of a decomposed character. */
+ if (u8c->p && *u8c->s == '\0') {
+ u8c->s = u8c->p;
+ u8c->p = NULL;
+ }
+
+ /* Check for end-of-string. */
+ if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) {
+ /* There is no next byte. */
+ if (u8c->ccc == STOPPER)
+ return 0;
+ /* End-of-string during a scan counts as a stopper. */
+ ccc = STOPPER;
+ goto ccc_mismatch;
+ } else if ((*u8c->s & 0xC0) == 0x80) {
+ /* This is a continuation of the current character. */
+ if (!u8c->p)
+ u8c->len--;
+ return (unsigned char)*u8c->s++;
+ }
+
+ /* Look up the data for the current character. */
+ if (u8c->p)
+ leaf = utf8lookup(u8c->data, u8c->s);
+ else
+ leaf = utf8nlookup(u8c->data, u8c->s, u8c->len);
+
+ /* No leaf found implies that the input is a binary blob. */
+ if (!leaf)
+ return -1;
+
+ /* Characters that are too new have CCC 0. */
+ if (utf8agetab[LEAF_GEN(leaf)] > u8c->data->maxage) {
+ ccc = STOPPER;
+ } else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) {
+ u8c->len -= utf8clen(u8c->s);
+ u8c->p = u8c->s + utf8clen(u8c->s);
+ u8c->s = LEAF_STR(leaf);
+ /* Empty decomposition implies CCC 0. */
+ if (*u8c->s == '\0') {
+ if (u8c->ccc == STOPPER)
+ continue;
+ ccc = STOPPER;
+ goto ccc_mismatch;
+ }
+ leaf = utf8lookup(u8c->data, u8c->s);
+ ccc = LEAF_CCC(leaf);
+ }
+
+ /*
+ * If this is not a stopper, then see if it updates
+ * the next canonical class to be emitted.
+ */
+ if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc)
+ u8c->nccc = ccc;
+
+ /*
+ * Return the current byte if this is the current
+ * combining class.
+ */
+ if (ccc == u8c->ccc) {
+ if (!u8c->p)
+ u8c->len--;
+ return (unsigned char)*u8c->s++;
+ }
+
+ /* Current combining class mismatch. */
+ ccc_mismatch:
+ if (u8c->nccc == STOPPER) {
+ /*
+ * Scan forward for the first canonical class
+ * to be emitted. Save the position from
+ * which to restart.
+ */
+ u8c->ccc = MINCCC - 1;
+ u8c->nccc = ccc;
+ u8c->sp = u8c->p;
+ u8c->ss = u8c->s;
+ u8c->slen = u8c->len;
+ if (!u8c->p)
+ u8c->len -= utf8clen(u8c->s);
+ u8c->s += utf8clen(u8c->s);
+ } else if (ccc != STOPPER) {
+ /* Not a stopper, and not the ccc we're emitting. */
+ if (!u8c->p)
+ u8c->len -= utf8clen(u8c->s);
+ u8c->s += utf8clen(u8c->s);
+ } else if (u8c->nccc != MAXCCC + 1) {
+ /* At a stopper, restart for next ccc. */
+ u8c->ccc = u8c->nccc;
+ u8c->nccc = MAXCCC + 1;
+ u8c->s = u8c->ss;
+ u8c->p = u8c->sp;
+ u8c->len = u8c->slen;
+ } else {
+ /* All done, proceed from here. */
+ u8c->ccc = STOPPER;
+ u8c->nccc = STOPPER;
+ u8c->sp = NULL;
+ u8c->ss = NULL;
+ u8c->slen = 0;
+ }
+ }
+}
+
+const struct utf8data *
+utf8nfkdi(unsigned int maxage)
+{
+ int i = sizeof(utf8nfkdidata)/sizeof(utf8nfkdidata[0]) - 1;
+
+ while (maxage < utf8nfkdidata[i].maxage)
+ i--;
+ if (maxage > utf8nfkdidata[i].maxage)
+ return NULL;
+ return &utf8nfkdidata[i];
+}
+
+const struct utf8data *
+utf8nfkdicf(unsigned int maxage)
+{
+ int i = sizeof(utf8nfkdicfdata)/sizeof(utf8nfkdicfdata[0]) - 1;
+
+ while (maxage < utf8nfkdicfdata[i].maxage)
+ i--;
+ if (maxage > utf8nfkdicfdata[i].maxage)
+ return NULL;
+ return &utf8nfkdicfdata[i];
+}
diff --git a/support/mkutf8data.c b/support/mkutf8data.c
new file mode 100644
index 0000000..e5c3507
--- /dev/null
+++ b/support/mkutf8data.c
@@ -0,0 +1,3232 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+/* Generator for a compact trie for unicode normalization */
+
+#include <sys/types.h>
+#include <stddef.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <assert.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+
+/* Default names of the in- and output files. */
+
+#define AGE_NAME "DerivedAge.txt"
+#define CCC_NAME "DerivedCombiningClass.txt"
+#define PROP_NAME "DerivedCoreProperties.txt"
+#define DATA_NAME "UnicodeData.txt"
+#define FOLD_NAME "CaseFolding.txt"
+#define NORM_NAME "NormalizationCorrections.txt"
+#define TEST_NAME "NormalizationTest.txt"
+#define UTF8_NAME "utf8data.h"
+
+const char *age_name = AGE_NAME;
+const char *ccc_name = CCC_NAME;
+const char *prop_name = PROP_NAME;
+const char *data_name = DATA_NAME;
+const char *fold_name = FOLD_NAME;
+const char *norm_name = NORM_NAME;
+const char *test_name = TEST_NAME;
+const char *utf8_name = UTF8_NAME;
+
+int verbose = 0;
+
+/* An arbitrary line size limit on input lines. */
+
+#define LINESIZE 1024
+char line[LINESIZE];
+char buf0[LINESIZE];
+char buf1[LINESIZE];
+char buf2[LINESIZE];
+char buf3[LINESIZE];
+
+const char *argv0;
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Unicode version numbers consist of three parts: major, minor, and a
+ * revision. These numbers are packed into an unsigned int to obtain
+ * a single version number.
+ *
+ * To save space in the generated trie, the unicode version is not
+ * stored directly, instead we calculate a generation number from the
+ * unicode versions seen in the DerivedAge file, and use that as an
+ * index into a table of unicode versions.
+ */
+#define UNICODE_MAJ_SHIFT (16)
+#define UNICODE_MIN_SHIFT (8)
+
+#define UNICODE_MAJ_MAX ((unsigned short)-1)
+#define UNICODE_MIN_MAX ((unsigned char)-1)
+#define UNICODE_REV_MAX ((unsigned char)-1)
+
+#define UNICODE_AGE(MAJ,MIN,REV) \
+ (((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) | \
+ ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) | \
+ ((unsigned int)(REV)))
+
+unsigned int *ages;
+int ages_count;
+
+unsigned int unicode_maxage;
+
+static int
+age_valid(unsigned int major, unsigned int minor, unsigned int revision)
+{
+ if (major > UNICODE_MAJ_MAX)
+ return 0;
+ if (minor > UNICODE_MIN_MAX)
+ return 0;
+ if (revision > UNICODE_REV_MAX)
+ return 0;
+ return 1;
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * utf8trie_t
+ *
+ * A compact binary tree, used to decode UTF-8 characters.
+ *
+ * Internal nodes are one byte for the node itself, and up to three
+ * bytes for an offset into the tree. The first byte contains the
+ * following information:
+ * NEXTBYTE - flag - advance to next byte if set
+ * BITNUM - 3 bit field - the bit number to tested
+ * OFFLEN - 2 bit field - number of bytes in the offset
+ * if offlen == 0 (non-branching node)
+ * RIGHTPATH - 1 bit field - set if the following node is for the
+ * right-hand path (tested bit is set)
+ * TRIENODE - 1 bit field - set if the following node is an internal
+ * node, otherwise it is a leaf node
+ * if offlen != 0 (branching node)
+ * LEFTNODE - 1 bit field - set if the left-hand node is internal
+ * RIGHTNODE - 1 bit field - set if the right-hand node is internal
+ *
+ * Due to the way utf8 works, there cannot be branching nodes with
+ * NEXTBYTE set, and moreover those nodes always have a righthand
+ * descendant.
+ */
+typedef unsigned char utf8trie_t;
+#define BITNUM 0x07
+#define NEXTBYTE 0x08
+#define OFFLEN 0x30
+#define OFFLEN_SHIFT 4
+#define RIGHTPATH 0x40
+#define TRIENODE 0x80
+#define RIGHTNODE 0x40
+#define LEFTNODE 0x80
+
+/*
+ * utf8leaf_t
+ *
+ * The leaves of the trie are embedded in the trie, and so the same
+ * underlying datatype, unsigned char.
+ *
+ * leaf[0]: The unicode version, stored as a generation number that is
+ * an index into utf8agetab[]. With this we can filter code
+ * points based on the unicode version in which they were
+ * defined. The CCC of a non-defined code point is 0.
+ * leaf[1]: Canonical Combining Class. During normalization, we need
+ * to do a stable sort into ascending order of all characters
+ * with a non-zero CCC that occur between two characters with
+ * a CCC of 0, or at the begin or end of a string.
+ * The unicode standard guarantees that all CCC values are
+ * between 0 and 254 inclusive, which leaves 255 available as
+ * a special value.
+ * Code points with CCC 0 are known as stoppers.
+ * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the
+ * start of a NUL-terminated string that is the decomposition
+ * of the character.
+ * The CCC of a decomposable character is the same as the CCC
+ * of the first character of its decomposition.
+ * Some characters decompose as the empty string: these are
+ * characters with the Default_Ignorable_Code_Point property.
+ * These do affect normalization, as they all have CCC 0.
+ *
+ * The decompositions in the trie have been fully expanded.
+ *
+ * Casefolding, if applicable, is also done using decompositions.
+ */
+typedef unsigned char utf8leaf_t;
+
+#define LEAF_GEN(LEAF) ((LEAF)[0])
+#define LEAF_CCC(LEAF) ((LEAF)[1])
+#define LEAF_STR(LEAF) ((const char*)((LEAF) + 2))
+
+#define MAXGEN (255)
+
+#define MINCCC (0)
+#define MAXCCC (254)
+#define STOPPER (0)
+#define DECOMPOSE (255)
+
+struct tree;
+static utf8leaf_t *utf8nlookup(struct tree *, const char *, size_t);
+static utf8leaf_t *utf8lookup(struct tree *, const char *);
+
+unsigned char *utf8data;
+size_t utf8data_size;
+
+utf8trie_t *nfkdi;
+utf8trie_t *nfkdicf;
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * UTF8 valid ranges.
+ *
+ * The UTF-8 encoding spreads the bits of a 32bit word over several
+ * bytes. This table gives the ranges that can be held and how they'd
+ * be represented.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * There is an additional requirement on UTF-8, in that only the
+ * shortest representation of a 32bit value is to be used. A decoder
+ * must not decode sequences that do not satisfy this requirement.
+ * Thus the allowed ranges have a lower bound.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * Actual unicode characters are limited to the range 0x0 - 0x10FFFF,
+ * 17 planes of 65536 values. This limits the sequences actually seen
+ * even more, to just the following.
+ *
+ * 0 - 0x7f: 0 0x7f
+ * 0x80 - 0x7ff: 0xc2 0x80 0xdf 0xbf
+ * 0x800 - 0xffff: 0xe0 0xa0 0x80 0xef 0xbf 0xbf
+ * 0x10000 - 0x10ffff: 0xf0 0x90 0x80 0x80 0xf4 0x8f 0xbf 0xbf
+ *
+ * Even within those ranges not all values are allowed: the surrogates
+ * 0xd800 - 0xdfff should never be seen.
+ *
+ * Note that the longest sequence seen with valid usage is 4 bytes,
+ * the same a single UTF-32 character. This makes the UTF-8
+ * representation of Unicode strictly smaller than UTF-32.
+ *
+ * The shortest sequence requirement was introduced by:
+ * Corrigendum #1: UTF-8 Shortest Form
+ * It can be found here:
+ * http://www.unicode.org/versions/corrigendum1.html
+ *
+ */
+
+#define UTF8_2_BITS 0xC0
+#define UTF8_3_BITS 0xE0
+#define UTF8_4_BITS 0xF0
+#define UTF8_N_BITS 0x80
+#define UTF8_2_MASK 0xE0
+#define UTF8_3_MASK 0xF0
+#define UTF8_4_MASK 0xF8
+#define UTF8_N_MASK 0xC0
+#define UTF8_V_MASK 0x3F
+#define UTF8_V_SHIFT 6
+
+static int
+utf8key(unsigned int key, char keyval[])
+{
+ int keylen;
+
+ if (key < 0x80) {
+ keyval[0] = key;
+ keylen = 1;
+ } else if (key < 0x800) {
+ keyval[1] = key & UTF8_V_MASK;
+ keyval[1] |= UTF8_N_BITS;
+ key >>= UTF8_V_SHIFT;
+ keyval[0] = key;
+ keyval[0] |= UTF8_2_BITS;
+ keylen = 2;
+ } else if (key < 0x10000) {
+ keyval[2] = key & UTF8_V_MASK;
+ keyval[2] |= UTF8_N_BITS;
+ key >>= UTF8_V_SHIFT;
+ keyval[1] = key & UTF8_V_MASK;
+ keyval[1] |= UTF8_N_BITS;
+ key >>= UTF8_V_SHIFT;
+ keyval[0] = key;
+ keyval[0] |= UTF8_3_BITS;
+ keylen = 3;
+ } else if (key < 0x110000) {
+ keyval[3] = key & UTF8_V_MASK;
+ keyval[3] |= UTF8_N_BITS;
+ key >>= UTF8_V_SHIFT;
+ keyval[2] = key & UTF8_V_MASK;
+ keyval[2] |= UTF8_N_BITS;
+ key >>= UTF8_V_SHIFT;
+ keyval[1] = key & UTF8_V_MASK;
+ keyval[1] |= UTF8_N_BITS;
+ key >>= UTF8_V_SHIFT;
+ keyval[0] = key;
+ keyval[0] |= UTF8_4_BITS;
+ keylen = 4;
+ } else {
+ printf("%#x: illegal key\n", key);
+ keylen = 0;
+ }
+ return keylen;
+}
+
+static unsigned int
+utf8code(const char *str)
+{
+ const unsigned char *s = (const unsigned char*)str;
+ unsigned int unichar = 0;
+
+ if (*s < 0x80) {
+ unichar = *s;
+ } else if (*s < UTF8_3_BITS) {
+ unichar = *s++ & 0x1F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s & 0x3F;
+ } else if (*s < UTF8_4_BITS) {
+ unichar = *s++ & 0x0F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s++ & 0x3F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s & 0x3F;
+ } else {
+ unichar = *s++ & 0x0F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s++ & 0x3F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s++ & 0x3F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s & 0x3F;
+ }
+ return unichar;
+}
+
+static int
+utf32valid(unsigned int unichar)
+{
+ return unichar < 0x110000;
+}
+
+#define NODE 1
+#define LEAF 0
+
+struct tree {
+ void *root;
+ int childnode;
+ const char *type;
+ unsigned int maxage;
+ struct tree *next;
+ int (*leaf_equal)(void *, void *);
+ void (*leaf_print)(void *, int);
+ int (*leaf_mark)(void *);
+ int (*leaf_size)(void *);
+ int *(*leaf_index)(struct tree *, void *);
+ unsigned char *(*leaf_emit)(void *, unsigned char *);
+ int leafindex[0x110000];
+ int index;
+};
+
+struct node {
+ int index;
+ int offset;
+ int mark;
+ int size;
+ struct node *parent;
+ void *left;
+ void *right;
+ unsigned char bitnum;
+ unsigned char nextbyte;
+ unsigned char leftnode;
+ unsigned char rightnode;
+ unsigned int keybits;
+ unsigned int keymask;
+};
+
+/*
+ * Example lookup function for a tree.
+ */
+static void *
+lookup(struct tree *tree, const char *key)
+{
+ struct node *node;
+ void *leaf = NULL;
+
+ node = tree->root;
+ while (!leaf && node) {
+ if (node->nextbyte)
+ key++;
+ if (*key & (1 << (node->bitnum & 7))) {
+ /* Right leg */
+ if (node->rightnode == NODE) {
+ node = node->right;
+ } else if (node->rightnode == LEAF) {
+ leaf = node->right;
+ } else {
+ node = NULL;
+ }
+ } else {
+ /* Left leg */
+ if (node->leftnode == NODE) {
+ node = node->left;
+ } else if (node->leftnode == LEAF) {
+ leaf = node->left;
+ } else {
+ node = NULL;
+ }
+ }
+ }
+
+ return leaf;
+}
+
+/*
+ * A simple non-recursive tree walker: keep track of visits to the
+ * left and right branches in the leftmask and rightmask.
+ */
+static void
+tree_walk(struct tree *tree)
+{
+ struct node *node;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int indent = 1;
+ int nodes, singletons, leaves;
+
+ nodes = singletons = leaves = 0;
+
+ printf("%s_%x root %p\n", tree->type, tree->maxage, tree->root);
+ if (tree->childnode == LEAF) {
+ assert(tree->root);
+ tree->leaf_print(tree->root, indent);
+ leaves = 1;
+ } else {
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ printf("%*snode @ %p bitnum %d nextbyte %d"
+ " left %p right %p mask %x bits %x\n",
+ indent, "", node,
+ node->bitnum, node->nextbyte,
+ node->left, node->right,
+ node->keymask, node->keybits);
+ nodes += 1;
+ if (!(node->left && node->right))
+ singletons += 1;
+
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if ((leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ tree->leaf_print(node->left,
+ indent+1);
+ leaves += 1;
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ indent += 1;
+ node = node->left;
+ break;
+ }
+ }
+ if ((rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ tree->leaf_print(node->right,
+ indent+1);
+ leaves += 1;
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ indent += 1;
+ node = node->right;
+ break;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ indent -= 1;
+ }
+ }
+ }
+ printf("nodes %d leaves %d singletons %d\n",
+ nodes, leaves, singletons);
+}
+
+/*
+ * Allocate an initialize a new internal node.
+ */
+static struct node *
+alloc_node(struct node *parent)
+{
+ struct node *node;
+ int bitnum;
+
+ node = malloc(sizeof(*node));
+ node->left = node->right = NULL;
+ node->parent = parent;
+ node->leftnode = NODE;
+ node->rightnode = NODE;
+ node->keybits = 0;
+ node->keymask = 0;
+ node->mark = 0;
+ node->index = 0;
+ node->offset = -1;
+ node->size = 4;
+
+ if (node->parent) {
+ bitnum = parent->bitnum;
+ if ((bitnum & 7) == 0) {
+ node->bitnum = bitnum + 7 + 8;
+ node->nextbyte = 1;
+ } else {
+ node->bitnum = bitnum - 1;
+ node->nextbyte = 0;
+ }
+ } else {
+ node->bitnum = 7;
+ node->nextbyte = 0;
+ }
+
+ return node;
+}
+
+/*
+ * Insert a new leaf into the tree, and collapse any subtrees that are
+ * fully populated and end in identical leaves. A nextbyte tagged
+ * internal node will not be removed to preserve the tree's integrity.
+ * Note that due to the structure of utf8, no nextbyte tagged node
+ * will be a candidate for removal.
+ */
+static int
+insert(struct tree *tree, char *key, int keylen, void *leaf)
+{
+ struct node *node;
+ struct node *parent;
+ void **cursor;
+ int keybits;
+
+ assert(keylen >= 1 && keylen <= 4);
+
+ node = NULL;
+ cursor = &tree->root;
+ keybits = 8 * keylen;
+
+ /* Insert, creating path along the way. */
+ while (keybits) {
+ if (!*cursor)
+ *cursor = alloc_node(node);
+ node = *cursor;
+ if (node->nextbyte)
+ key++;
+ if (*key & (1 << (node->bitnum & 7)))
+ cursor = &node->right;
+ else
+ cursor = &node->left;
+ keybits--;
+ }
+ *cursor = leaf;
+
+ /* Merge subtrees if possible. */
+ while (node) {
+ if (*key & (1 << (node->bitnum & 7)))
+ node->rightnode = LEAF;
+ else
+ node->leftnode = LEAF;
+ if (node->nextbyte)
+ break;
+ if (node->leftnode == NODE || node->rightnode == NODE)
+ break;
+ assert(node->left);
+ assert(node->right);
+ /* Compare */
+ if (! tree->leaf_equal(node->left, node->right))
+ break;
+ /* Keep left, drop right leaf. */
+ leaf = node->left;
+ /* Check in parent */
+ parent = node->parent;
+ if (!parent) {
+ /* root of tree! */
+ tree->root = leaf;
+ tree->childnode = LEAF;
+ } else if (parent->left == node) {
+ parent->left = leaf;
+ parent->leftnode = LEAF;
+ if (parent->right) {
+ parent->keymask = 0;
+ parent->keybits = 0;
+ } else {
+ parent->keymask |= (1 << node->bitnum);
+ }
+ } else if (parent->right == node) {
+ parent->right = leaf;
+ parent->rightnode = LEAF;
+ if (parent->left) {
+ parent->keymask = 0;
+ parent->keybits = 0;
+ } else {
+ parent->keymask |= (1 << node->bitnum);
+ parent->keybits |= (1 << node->bitnum);
+ }
+ } else {
+ /* internal tree error */
+ assert(0);
+ }
+ free(node);
+ node = parent;
+ }
+
+ /* Propagate keymasks up along singleton chains. */
+ while (node) {
+ parent = node->parent;
+ if (!parent)
+ break;
+ /* Nix the mask for parents with two children. */
+ if (node->keymask == 0) {
+ parent->keymask = 0;
+ parent->keybits = 0;
+ } else if (parent->left && parent->right) {
+ parent->keymask = 0;
+ parent->keybits = 0;
+ } else {
+ assert((parent->keymask & node->keymask) == 0);
+ parent->keymask |= node->keymask;
+ parent->keymask |= (1 << parent->bitnum);
+ parent->keybits |= node->keybits;
+ if (parent->right)
+ parent->keybits |= (1 << parent->bitnum);
+ }
+ node = parent;
+ }
+
+ return 0;
+}
+
+/*
+ * Prune internal nodes.
+ *
+ * Fully populated subtrees that end at the same leaf have already
+ * been collapsed. There are still internal nodes that have for both
+ * their left and right branches a sequence of singletons that make
+ * identical choices and end in identical leaves. The keymask and
+ * keybits collected in the nodes describe the choices made in these
+ * singleton chains. When they are identical for the left and right
+ * branch of a node, and the two leaves comare identical, the node in
+ * question can be removed.
+ *
+ * Note that nodes with the nextbyte tag set will not be removed by
+ * this to ensure tree integrity. Note as well that the structure of
+ * utf8 ensures that these nodes would not have been candidates for
+ * removal in any case.
+ */
+static void
+prune(struct tree *tree)
+{
+ struct node *node;
+ struct node *left;
+ struct node *right;
+ struct node *parent;
+ void *leftleaf;
+ void *rightleaf;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int count;
+
+ if (verbose > 0)
+ printf("Pruning %s_%x\n", tree->type, tree->maxage);
+
+ count = 0;
+ if (tree->childnode == LEAF)
+ return;
+ if (!tree->root)
+ return;
+
+ leftmask = rightmask = 0;
+ node = tree->root;
+ while (node) {
+ if (node->nextbyte)
+ goto advance;
+ if (node->leftnode == LEAF)
+ goto advance;
+ if (node->rightnode == LEAF)
+ goto advance;
+ if (!node->left)
+ goto advance;
+ if (!node->right)
+ goto advance;
+ left = node->left;
+ right = node->right;
+ if (left->keymask == 0)
+ goto advance;
+ if (right->keymask == 0)
+ goto advance;
+ if (left->keymask != right->keymask)
+ goto advance;
+ if (left->keybits != right->keybits)
+ goto advance;
+ leftleaf = NULL;
+ while (!leftleaf) {
+ assert(left->left || left->right);
+ if (left->leftnode == LEAF)
+ leftleaf = left->left;
+ else if (left->rightnode == LEAF)
+ leftleaf = left->right;
+ else if (left->left)
+ left = left->left;
+ else if (left->right)
+ left = left->right;
+ else
+ assert(0);
+ }
+ rightleaf = NULL;
+ while (!rightleaf) {
+ assert(right->left || right->right);
+ if (right->leftnode == LEAF)
+ rightleaf = right->left;
+ else if (right->rightnode == LEAF)
+ rightleaf = right->right;
+ else if (right->left)
+ right = right->left;
+ else if (right->right)
+ right = right->right;
+ else
+ assert(0);
+ }
+ if (! tree->leaf_equal(leftleaf, rightleaf))
+ goto advance;
+ /*
+ * This node has identical singleton-only subtrees.
+ * Remove it.
+ */
+ parent = node->parent;
+ left = node->left;
+ right = node->right;
+ if (parent->left == node)
+ parent->left = left;
+ else if (parent->right == node)
+ parent->right = left;
+ else
+ assert(0);
+ left->parent = parent;
+ left->keymask |= (1 << node->bitnum);
+ node->left = NULL;
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ if (node->leftnode == NODE && node->left) {
+ left = node->left;
+ free(node);
+ count++;
+ node = left;
+ } else if (node->rightnode == NODE && node->right) {
+ right = node->right;
+ free(node);
+ count++;
+ node = right;
+ } else {
+ node = NULL;
+ }
+ }
+ /* Propagate keymasks up along singleton chains. */
+ node = parent;
+ /* Force re-check */
+ bitmask = 1 << node->bitnum;
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ for (;;) {
+ if (node->left && node->right)
+ break;
+ if (node->left) {
+ left = node->left;
+ node->keymask |= left->keymask;
+ node->keybits |= left->keybits;
+ }
+ if (node->right) {
+ right = node->right;
+ node->keymask |= right->keymask;
+ node->keybits |= right->keybits;
+ }
+ node->keymask |= (1 << node->bitnum);
+ node = node->parent;
+ /* Force re-check */
+ bitmask = 1 << node->bitnum;
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ }
+ advance:
+ bitmask = 1 << node->bitnum;
+ if ((leftmask & bitmask) == 0 &&
+ node->leftnode == NODE &&
+ node->left) {
+ leftmask |= bitmask;
+ node = node->left;
+ } else if ((rightmask & bitmask) == 0 &&
+ node->rightnode == NODE &&
+ node->right) {
+ rightmask |= bitmask;
+ node = node->right;
+ } else {
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ }
+ }
+ if (verbose > 0)
+ printf("Pruned %d nodes\n", count);
+}
+
+/*
+ * Mark the nodes in the tree that lead to leaves that must be
+ * emitted.
+ */
+static void
+mark_nodes(struct tree *tree)
+{
+ struct node *node;
+ struct node *n;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int marked;
+
+ marked = 0;
+ if (verbose > 0)
+ printf("Marking %s_%x\n", tree->type, tree->maxage);
+ if (tree->childnode == LEAF)
+ goto done;
+
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if ((leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ if (tree->leaf_mark(node->left)) {
+ n = node;
+ while (n && !n->mark) {
+ marked++;
+ n->mark = 1;
+ n = n->parent;
+ }
+ }
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ node = node->left;
+ continue;
+ }
+ }
+ if ((rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ if (tree->leaf_mark(node->right)) {
+ n = node;
+ while (n && !n->mark) {
+ marked++;
+ n->mark = 1;
+ n = n->parent;
+ }
+ }
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ node = node->right;
+ continue;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ }
+
+ /* second pass: left siblings and singletons */
+
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if ((leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ if (tree->leaf_mark(node->left)) {
+ n = node;
+ while (n && !n->mark) {
+ marked++;
+ n->mark = 1;
+ n = n->parent;
+ }
+ }
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ node = node->left;
+ if (!node->mark && node->parent->mark) {
+ marked++;
+ node->mark = 1;
+ }
+ continue;
+ }
+ }
+ if ((rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ if (tree->leaf_mark(node->right)) {
+ n = node;
+ while (n && !n->mark) {
+ marked++;
+ n->mark = 1;
+ n = n->parent;
+ }
+ }
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ node = node->right;
+ if (!node->mark && node->parent->mark &&
+ !node->parent->left) {
+ marked++;
+ node->mark = 1;
+ }
+ continue;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ }
+done:
+ if (verbose > 0)
+ printf("Marked %d nodes\n", marked);
+}
+
+/*
+ * Compute the index of each node and leaf, which is the offset in the
+ * emitted trie. These value must be pre-computed because relative
+ * offsets between nodes are used to navigate the tree.
+ */
+static int
+index_nodes(struct tree *tree, int index)
+{
+ struct node *node;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int count;
+ int indent;
+
+ /* Align to a cache line (or half a cache line?). */
+ while (index % 64)
+ index++;
+ tree->index = index;
+ indent = 1;
+ count = 0;
+
+ if (verbose > 0)
+ printf("Indexing %s_%x: %d", tree->type, tree->maxage, index);
+ if (tree->childnode == LEAF) {
+ index += tree->leaf_size(tree->root);
+ goto done;
+ }
+
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ if (!node->mark)
+ goto skip;
+ count++;
+ if (node->index != index)
+ node->index = index;
+ index += node->size;
+skip:
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if (node->mark && (leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ *tree->leaf_index(tree, node->left) =
+ index;
+ index += tree->leaf_size(node->left);
+ count++;
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ indent += 1;
+ node = node->left;
+ break;
+ }
+ }
+ if (node->mark && (rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ *tree->leaf_index(tree, node->right) = index;
+ index += tree->leaf_size(node->right);
+ count++;
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ indent += 1;
+ node = node->right;
+ break;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ indent -= 1;
+ }
+ }
+done:
+ /* Round up to a multiple of 16 */
+ while (index % 16)
+ index++;
+ if (verbose > 0)
+ printf("Final index %d\n", index);
+ return index;
+}
+
+/*
+ * Compute the size of nodes and leaves. We start by assuming that
+ * each node needs to store a three-byte offset. The indexes of the
+ * nodes are calculated based on that, and then this function is
+ * called to see if the sizes of some nodes can be reduced. This is
+ * repeated until no more changes are seen.
+ */
+static int
+size_nodes(struct tree *tree)
+{
+ struct tree *next;
+ struct node *node;
+ struct node *right;
+ struct node *n;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ unsigned int pathbits;
+ unsigned int pathmask;
+ int changed;
+ int offset;
+ int size;
+ int indent;
+
+ indent = 1;
+ changed = 0;
+ size = 0;
+
+ if (verbose > 0)
+ printf("Sizing %s_%x", tree->type, tree->maxage);
+ if (tree->childnode == LEAF)
+ goto done;
+
+ assert(tree->childnode == NODE);
+ pathbits = 0;
+ pathmask = 0;
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ if (!node->mark)
+ goto skip;
+ offset = 0;
+ if (!node->left || !node->right) {
+ size = 1;
+ } else {
+ if (node->rightnode == NODE) {
+ right = node->right;
+ next = tree->next;
+ while (!right->mark) {
+ assert(next);
+ n = next->root;
+ while (n->bitnum != node->bitnum) {
+ if (pathbits & (1<<n->bitnum))
+ n = n->right;
+ else
+ n = n->left;
+ }
+ n = n->right;
+ assert(right->bitnum == n->bitnum);
+ right = n;
+ next = next->next;
+ }
+ offset = right->index - node->index;
+ } else {
+ offset = *tree->leaf_index(tree, node->right);
+ offset -= node->index;
+ }
+ assert(offset >= 0);
+ assert(offset <= 0xffffff);
+ if (offset <= 0xff) {
+ size = 2;
+ } else if (offset <= 0xffff) {
+ size = 3;
+ } else { /* offset <= 0xffffff */
+ size = 4;
+ }
+ }
+ if (node->size != size || node->offset != offset) {
+ node->size = size;
+ node->offset = offset;
+ changed++;
+ }
+skip:
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ pathmask |= bitmask;
+ if (node->mark && (leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ indent += 1;
+ node = node->left;
+ break;
+ }
+ }
+ if (node->mark && (rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ pathbits |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ indent += 1;
+ node = node->right;
+ break;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ pathmask &= ~bitmask;
+ pathbits &= ~bitmask;
+ node = node->parent;
+ indent -= 1;
+ }
+ }
+done:
+ if (verbose > 0)
+ printf("Found %d changes\n", changed);
+ return changed;
+}
+
+/*
+ * Emit a trie for the given tree into the data array.
+ */
+static void
+emit(struct tree *tree, unsigned char *data)
+{
+ struct node *node;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int offlen;
+ int offset;
+ int index;
+ int indent;
+ unsigned char byte;
+
+ index = tree->index;
+ data += index;
+ indent = 1;
+ if (verbose > 0)
+ printf("Emitting %s_%x\n", tree->type, tree->maxage);
+ if (tree->childnode == LEAF) {
+ assert(tree->root);
+ tree->leaf_emit(tree->root, data);
+ return;
+ }
+
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ if (!node->mark)
+ goto skip;
+ assert(node->offset != -1);
+ assert(node->index == index);
+
+ byte = 0;
+ if (node->nextbyte)
+ byte |= NEXTBYTE;
+ byte |= (node->bitnum & BITNUM);
+ if (node->left && node->right) {
+ if (node->leftnode == NODE)
+ byte |= LEFTNODE;
+ if (node->rightnode == NODE)
+ byte |= RIGHTNODE;
+ if (node->offset <= 0xff)
+ offlen = 1;
+ else if (node->offset <= 0xffff)
+ offlen = 2;
+ else
+ offlen = 3;
+ offset = node->offset;
+ byte |= offlen << OFFLEN_SHIFT;
+ *data++ = byte;
+ index++;
+ while (offlen--) {
+ *data++ = offset & 0xff;
+ index++;
+ offset >>= 8;
+ }
+ } else if (node->left) {
+ if (node->leftnode == NODE)
+ byte |= TRIENODE;
+ *data++ = byte;
+ index++;
+ } else if (node->right) {
+ byte |= RIGHTNODE;
+ if (node->rightnode == NODE)
+ byte |= TRIENODE;
+ *data++ = byte;
+ index++;
+ } else {
+ assert(0);
+ }
+skip:
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if (node->mark && (leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ data = tree->leaf_emit(node->left,
+ data);
+ index += tree->leaf_size(node->left);
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ indent += 1;
+ node = node->left;
+ break;
+ }
+ }
+ if (node->mark && (rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ data = tree->leaf_emit(node->right,
+ data);
+ index += tree->leaf_size(node->right);
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ indent += 1;
+ node = node->right;
+ break;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ indent -= 1;
+ }
+ }
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Unicode data.
+ *
+ * We need to keep track of the Canonical Combining Class, the Age,
+ * and decompositions for a code point.
+ *
+ * For the Age, we store the index into the ages table. Effectively
+ * this is a generation number that the table maps to a unicode
+ * version.
+ *
+ * The correction field is used to indicate that this entry is in the
+ * corrections array, which contains decompositions that were
+ * corrected in later revisions. The value of the correction field is
+ * the Unicode version in which the mapping was corrected.
+ */
+struct unicode_data {
+ unsigned int code;
+ int ccc;
+ int gen;
+ int correction;
+ unsigned int *utf32nfkdi;
+ unsigned int *utf32nfkdicf;
+ char *utf8nfkdi;
+ char *utf8nfkdicf;
+};
+
+struct unicode_data unicode_data[0x110000];
+struct unicode_data *corrections;
+int corrections_count;
+
+struct tree *nfkdi_tree;
+struct tree *nfkdicf_tree;
+
+struct tree *trees;
+int trees_count;
+
+/*
+ * Check the corrections array to see if this entry was corrected at
+ * some point.
+ */
+static struct unicode_data *
+corrections_lookup(struct unicode_data *u)
+{
+ int i;
+
+ for (i = 0; i != corrections_count; i++)
+ if (u->code == corrections[i].code)
+ return &corrections[i];
+ return u;
+}
+
+static int
+nfkdi_equal(void *l, void *r)
+{
+ struct unicode_data *left = l;
+ struct unicode_data *right = r;
+
+ if (left->gen != right->gen)
+ return 0;
+ if (left->ccc != right->ccc)
+ return 0;
+ if (left->utf8nfkdi && right->utf8nfkdi &&
+ strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0)
+ return 1;
+ if (left->utf8nfkdi || right->utf8nfkdi)
+ return 0;
+ return 1;
+}
+
+static int
+nfkdicf_equal(void *l, void *r)
+{
+ struct unicode_data *left = l;
+ struct unicode_data *right = r;
+
+ if (left->gen != right->gen)
+ return 0;
+ if (left->ccc != right->ccc)
+ return 0;
+ if (left->utf8nfkdicf && right->utf8nfkdicf &&
+ strcmp(left->utf8nfkdicf, right->utf8nfkdicf) == 0)
+ return 1;
+ if (left->utf8nfkdicf && right->utf8nfkdicf)
+ return 0;
+ if (left->utf8nfkdicf || right->utf8nfkdicf)
+ return 0;
+ if (left->utf8nfkdi && right->utf8nfkdi &&
+ strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0)
+ return 1;
+ if (left->utf8nfkdi || right->utf8nfkdi)
+ return 0;
+ return 1;
+}
+
+static void
+nfkdi_print(void *l, int indent)
+{
+ struct unicode_data *leaf = l;
+
+ printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
+ leaf->code, leaf->ccc, leaf->gen);
+ if (leaf->utf8nfkdi)
+ printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
+ printf("\n");
+}
+
+static void
+nfkdicf_print(void *l, int indent)
+{
+ struct unicode_data *leaf = l;
+
+ printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
+ leaf->code, leaf->ccc, leaf->gen);
+ if (leaf->utf8nfkdicf)
+ printf(" nfkdicf \"%s\"", (const char*)leaf->utf8nfkdicf);
+ else if (leaf->utf8nfkdi)
+ printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
+ printf("\n");
+}
+
+static int
+nfkdi_mark(void *l)
+{
+ return 1;
+}
+
+static int
+nfkdicf_mark(void *l)
+{
+ struct unicode_data *leaf = l;
+ if (leaf->utf8nfkdicf)
+ return 1;
+ return 0;
+}
+
+static int
+correction_mark(void *l)
+{
+ struct unicode_data *leaf = l;
+ return leaf->correction;
+}
+
+static int
+nfkdi_size(void *l)
+{
+ struct unicode_data *leaf = l;
+ int size = 2;
+ if (leaf->utf8nfkdi)
+ size += strlen(leaf->utf8nfkdi) + 1;
+ return size;
+}
+
+static int
+nfkdicf_size(void *l)
+{
+ struct unicode_data *leaf = l;
+ int size = 2;
+ if (leaf->utf8nfkdicf)
+ size += strlen(leaf->utf8nfkdicf) + 1;
+ else if (leaf->utf8nfkdi)
+ size += strlen(leaf->utf8nfkdi) + 1;
+ return size;
+}
+
+static int *
+nfkdi_index(struct tree *tree, void *l)
+{
+ struct unicode_data *leaf = l;
+ return &tree->leafindex[leaf->code];
+}
+
+static int *
+nfkdicf_index(struct tree *tree, void *l)
+{
+ struct unicode_data *leaf = l;
+ return &tree->leafindex[leaf->code];
+}
+
+static unsigned char *
+nfkdi_emit(void *l, unsigned char *data)
+{
+ struct unicode_data *leaf = l;
+ unsigned char *s;
+
+ *data++ = leaf->gen;
+ if (leaf->utf8nfkdi) {
+ *data++ = DECOMPOSE;
+ s = (unsigned char*)leaf->utf8nfkdi;
+ while ((*data++ = *s++) != 0)
+ ;
+ } else {
+ *data++ = leaf->ccc;
+ }
+ return data;
+}
+
+static unsigned char *
+nfkdicf_emit(void *l, unsigned char *data)
+{
+ struct unicode_data *leaf = l;
+ unsigned char *s;
+
+ *data++ = leaf->gen;
+ if (leaf->utf8nfkdicf) {
+ *data++ = DECOMPOSE;
+ s = (unsigned char*)leaf->utf8nfkdicf;
+ while ((*data++ = *s++) != 0)
+ ;
+ } else if (leaf->utf8nfkdi) {
+ *data++ = DECOMPOSE;
+ s = (unsigned char*)leaf->utf8nfkdi;
+ while ((*data++ = *s++) != 0)
+ ;
+ } else {
+ *data++ = leaf->ccc;
+ }
+ return data;
+}
+
+static void
+utf8_create(struct unicode_data *data)
+{
+ char utf[18*4+1];
+ char *u;
+ unsigned int *um;
+ int i;
+
+ u = utf;
+ um = data->utf32nfkdi;
+ if (um) {
+ for (i = 0; um[i]; i++)
+ u += utf8key(um[i], u);
+ *u = '\0';
+ data->utf8nfkdi = strdup((char*)utf);
+ }
+ u = utf;
+ um = data->utf32nfkdicf;
+ if (um) {
+ for (i = 0; um[i]; i++)
+ u += utf8key(um[i], u);
+ *u = '\0';
+ if (!data->utf8nfkdi || strcmp(data->utf8nfkdi, (char*)utf))
+ data->utf8nfkdicf = strdup((char*)utf);
+ }
+}
+
+static void
+utf8_init(void)
+{
+ unsigned int unichar;
+ int i;
+
+ for (unichar = 0; unichar != 0x110000; unichar++)
+ utf8_create(&unicode_data[unichar]);
+
+ for (i = 0; i != corrections_count; i++)
+ utf8_create(&corrections[i]);
+}
+
+static void
+trees_init(void)
+{
+ struct unicode_data *data;
+ unsigned int maxage;
+ unsigned int nextage;
+ int count;
+ int i;
+ int j;
+
+ /* Count the number of different ages. */
+ count = 0;
+ nextage = (unsigned int)-1;
+ do {
+ maxage = nextage;
+ nextage = 0;
+ for (i = 0; i <= corrections_count; i++) {
+ data = &corrections[i];
+ if (nextage < data->correction &&
+ data->correction < maxage)
+ nextage = data->correction;
+ }
+ count++;
+ } while (nextage);
+
+ /* Two trees per age: nfkdi and nfkdicf */
+ trees_count = count * 2;
+ trees = calloc(trees_count, sizeof(struct tree));
+
+ /* Assign ages to the trees. */
+ count = trees_count;
+ nextage = (unsigned int)-1;
+ do {
+ maxage = nextage;
+ trees[--count].maxage = maxage;
+ trees[--count].maxage = maxage;
+ nextage = 0;
+ for (i = 0; i <= corrections_count; i++) {
+ data = &corrections[i];
+ if (nextage < data->correction &&
+ data->correction < maxage)
+ nextage = data->correction;
+ }
+ } while (nextage);
+
+ /* The ages assigned above are off by one. */
+ for (i = 0; i != trees_count; i++) {
+ j = 0;
+ while (ages[j] < trees[i].maxage)
+ j++;
+ trees[i].maxage = ages[j-1];
+ }
+
+ /* Set up the forwarding between trees. */
+ trees[trees_count-2].next = &trees[trees_count-1];
+ trees[trees_count-1].leaf_mark = nfkdi_mark;
+ trees[trees_count-2].leaf_mark = nfkdicf_mark;
+ for (i = 0; i != trees_count-2; i += 2) {
+ trees[i].next = &trees[trees_count-2];
+ trees[i].leaf_mark = correction_mark;
+ trees[i+1].next = &trees[trees_count-1];
+ trees[i+1].leaf_mark = correction_mark;
+ }
+
+ /* Assign the callouts. */
+ for (i = 0; i != trees_count; i += 2) {
+ trees[i].type = "nfkdicf";
+ trees[i].leaf_equal = nfkdicf_equal;
+ trees[i].leaf_print = nfkdicf_print;
+ trees[i].leaf_size = nfkdicf_size;
+ trees[i].leaf_index = nfkdicf_index;
+ trees[i].leaf_emit = nfkdicf_emit;
+
+ trees[i+1].type = "nfkdi";
+ trees[i+1].leaf_equal = nfkdi_equal;
+ trees[i+1].leaf_print = nfkdi_print;
+ trees[i+1].leaf_size = nfkdi_size;
+ trees[i+1].leaf_index = nfkdi_index;
+ trees[i+1].leaf_emit = nfkdi_emit;
+ }
+
+ /* Finish init. */
+ for (i = 0; i != trees_count; i++)
+ trees[i].childnode = NODE;
+}
+
+static void
+trees_populate(void)
+{
+ struct unicode_data *data;
+ unsigned int unichar;
+ char keyval[4];
+ int keylen;
+ int i;
+
+ for (i = 0; i != trees_count; i++) {
+ if (verbose > 0) {
+ printf("Populating %s_%x\n",
+ trees[i].type, trees[i].maxage);
+ }
+ for (unichar = 0; unichar != 0x110000; unichar++) {
+ if (unicode_data[unichar].gen < 0)
+ continue;
+ keylen = utf8key(unichar, keyval);
+ data = corrections_lookup(&unicode_data[unichar]);
+ if (data->correction <= trees[i].maxage)
+ data = &unicode_data[unichar];
+ insert(&trees[i], keyval, keylen, data);
+ }
+ }
+}
+
+static void
+trees_reduce(void)
+{
+ int i;
+ int size;
+ int changed;
+
+ for (i = 0; i != trees_count; i++)
+ prune(&trees[i]);
+ for (i = 0; i != trees_count; i++)
+ mark_nodes(&trees[i]);
+ do {
+ size = 0;
+ for (i = 0; i != trees_count; i++)
+ size = index_nodes(&trees[i], size);
+ changed = 0;
+ for (i = 0; i != trees_count; i++)
+ changed += size_nodes(&trees[i]);
+ } while (changed);
+
+ utf8data = calloc(size, 1);
+ utf8data_size = size;
+ for (i = 0; i != trees_count; i++)
+ emit(&trees[i], utf8data);
+
+ if (verbose > 0) {
+ for (i = 0; i != trees_count; i++) {
+ printf("%s_%x idx %d\n",
+ trees[i].type, trees[i].maxage, trees[i].index);
+ }
+ }
+
+ nfkdi = utf8data + trees[trees_count-1].index;
+ nfkdicf = utf8data + trees[trees_count-2].index;
+
+ nfkdi_tree = &trees[trees_count-1];
+ nfkdicf_tree = &trees[trees_count-2];
+}
+
+static void
+verify(struct tree *tree)
+{
+ struct unicode_data *data;
+ utf8leaf_t *leaf;
+ unsigned int unichar;
+ char key[4];
+ int report;
+ int nocf;
+
+ if (verbose > 0)
+ printf("Verifying %s_%x\n", tree->type, tree->maxage);
+ nocf = strcmp(tree->type, "nfkdicf");
+
+ for (unichar = 0; unichar != 0x110000; unichar++) {
+ report = 0;
+ data = corrections_lookup(&unicode_data[unichar]);
+ if (data->correction <= tree->maxage)
+ data = &unicode_data[unichar];
+ utf8key(unichar, key);
+ leaf = utf8lookup(tree, key);
+ if (!leaf) {
+ if (data->gen != -1)
+ report++;
+ if (unichar < 0xd800 || unichar > 0xdfff)
+ report++;
+ } else {
+ if (unichar >= 0xd800 && unichar <= 0xdfff)
+ report++;
+ if (data->gen == -1)
+ report++;
+ if (data->gen != LEAF_GEN(leaf))
+ report++;
+ if (LEAF_CCC(leaf) == DECOMPOSE) {
+ if (nocf) {
+ if (!data->utf8nfkdi) {
+ report++;
+ } else if (strcmp(data->utf8nfkdi,
+ LEAF_STR(leaf))) {
+ report++;
+ }
+ } else {
+ if (!data->utf8nfkdicf &&
+ !data->utf8nfkdi) {
+ report++;
+ } else if (data->utf8nfkdicf) {
+ if (strcmp(data->utf8nfkdicf,
+ LEAF_STR(leaf)))
+ report++;
+ } else if (strcmp(data->utf8nfkdi,
+ LEAF_STR(leaf))) {
+ report++;
+ }
+ }
+ } else if (data->ccc != LEAF_CCC(leaf)) {
+ report++;
+ }
+ }
+ if (report) {
+ printf("%X code %X gen %d ccc %d"
+ " nfdki -> \"%s\"",
+ unichar, data->code, data->gen,
+ data->ccc,
+ data->utf8nfkdi);
+ if (leaf) {
+ printf(" age %d ccc %d"
+ " nfdki -> \"%s\"\n",
+ LEAF_GEN(leaf),
+ LEAF_CCC(leaf),
+ LEAF_CCC(leaf) == DECOMPOSE ?
+ LEAF_STR(leaf) : "");
+ }
+ printf("\n");
+ }
+ }
+}
+
+static void
+trees_verify(void)
+{
+ int i;
+
+ for (i = 0; i != trees_count; i++)
+ verify(&trees[i]);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+help(void)
+{
+ printf("Usage: %s [options]\n", argv0);
+ printf("\n");
+ printf("This program creates an a data trie used for parsing and\n");
+ printf("normalization of UTF-8 strings. The trie is derived from\n");
+ printf("a set of input files from the Unicode character database\n");
+ printf("found at: http://www.unicode.org/Public/UCD/latest/ucd/\n");
+ printf("\n");
+ printf("The generated tree supports two normalization forms:\n");
+ printf("\n");
+ printf("\tnfkdi:\n");
+ printf("\t- Apply unicode normalization form NFKD.\n");
+ printf("\t- Remove any Default_Ignorable_Code_Point.\n");
+ printf("\n");
+ printf("\tnfkdicf:\n");
+ printf("\t- Apply unicode normalization form NFKD.\n");
+ printf("\t- Remove any Default_Ignorable_Code_Point.\n");
+ printf("\t- Apply a full casefold (C + F).\n");
+ printf("\n");
+ printf("These forms were chosen as being most useful when dealing\n");
+ printf("with file names: NFKD catches most cases where characters\n");
+ printf("should be considered equivalent. The ignorables are mostly\n");
+ printf("invisible, making names hard to type.\n");
+ printf("\n");
+ printf("The options to specify the files to be used are listed\n");
+ printf("below with their default values, which are the names used\n");
+ printf("by version 7.0.0 of the Unicode Character Database.\n");
+ printf("\n");
+ printf("The input files:\n");
+ printf("\t-a %s\n", AGE_NAME);
+ printf("\t-c %s\n", CCC_NAME);
+ printf("\t-p %s\n", PROP_NAME);
+ printf("\t-d %s\n", DATA_NAME);
+ printf("\t-f %s\n", FOLD_NAME);
+ printf("\t-n %s\n", NORM_NAME);
+ printf("\n");
+ printf("Additionally, the generated tables are tested using:\n");
+ printf("\t-t %s\n", TEST_NAME);
+ printf("\n");
+ printf("Finally, the output file:\n");
+ printf("\t-o %s\n", UTF8_NAME);
+ printf("\n");
+}
+
+static void
+usage(void)
+{
+ help();
+ exit(1);
+}
+
+static void
+open_fail(const char *name, int error)
+{
+ printf("Error %d opening %s: %s\n", error, name, strerror(error));
+ exit(1);
+}
+
+static void
+file_fail(const char *filename)
+{
+ printf("Error parsing %s\n", filename);
+ exit(1);
+}
+
+static void
+line_fail(const char *filename, const char *line)
+{
+ printf("Error parsing %s:%s\n", filename, line);
+ exit(1);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+print_utf32(unsigned int *utf32str)
+{
+ int i;
+ for (i = 0; utf32str[i]; i++)
+ printf(" %X", utf32str[i]);
+}
+
+static void
+print_utf32nfkdi(unsigned int unichar)
+{
+ printf(" %X ->", unichar);
+ print_utf32(unicode_data[unichar].utf32nfkdi);
+ printf("\n");
+}
+
+static void
+print_utf32nfkdicf(unsigned int unichar)
+{
+ printf(" %X ->", unichar);
+ print_utf32(unicode_data[unichar].utf32nfkdicf);
+ printf("\n");
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+age_init(void)
+{
+ FILE *file;
+ unsigned int first;
+ unsigned int last;
+ unsigned int unichar;
+ unsigned int major;
+ unsigned int minor;
+ unsigned int revision;
+ int gen;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", age_name);
+
+ file = fopen(age_name, "r");
+ if (!file)
+ open_fail(age_name, errno);
+ count = 0;
+
+ gen = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "# Age=V%d_%d_%d",
+ &major, &minor, &revision);
+ if (ret == 3) {
+ ages_count++;
+ if (verbose > 1)
+ printf(" Age V%d_%d_%d\n",
+ major, minor, revision);
+ if (!age_valid(major, minor, revision))
+ line_fail(age_name, line);
+ continue;
+ }
+ ret = sscanf(line, "# Age=V%d_%d", &major, &minor);
+ if (ret == 2) {
+ ages_count++;
+ if (verbose > 1)
+ printf(" Age V%d_%d\n", major, minor);
+ if (!age_valid(major, minor, 0))
+ line_fail(age_name, line);
+ continue;
+ }
+ }
+
+ /* We must have found something above. */
+ if (verbose > 1)
+ printf("%d age entries\n", ages_count);
+ if (ages_count == 0 || ages_count > MAXGEN)
+ file_fail(age_name);
+
+ /* There is a 0 entry. */
+ ages_count++;
+ ages = calloc(ages_count + 1, sizeof(*ages));
+ /* And a guard entry. */
+ ages[ages_count] = (unsigned int)-1;
+
+ rewind(file);
+ count = 0;
+ gen = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "# Age=V%d_%d_%d",
+ &major, &minor, &revision);
+ if (ret == 3) {
+ ages[++gen] =
+ UNICODE_AGE(major, minor, revision);
+ if (verbose > 1)
+ printf(" Age V%d_%d_%d = gen %d\n",
+ major, minor, revision, gen);
+ if (!age_valid(major, minor, revision))
+ line_fail(age_name, line);
+ continue;
+ }
+ ret = sscanf(line, "# Age=V%d_%d", &major, &minor);
+ if (ret == 2) {
+ ages[++gen] = UNICODE_AGE(major, minor, 0);
+ if (verbose > 1)
+ printf(" Age V%d_%d = %d\n",
+ major, minor, gen);
+ if (!age_valid(major, minor, 0))
+ line_fail(age_name, line);
+ continue;
+ }
+ ret = sscanf(line, "%X..%X ; %d.%d #",
+ &first, &last, &major, &minor);
+ if (ret == 4) {
+ for (unichar = first; unichar <= last; unichar++)
+ unicode_data[unichar].gen = gen;
+ count += 1 + last - first;
+ if (verbose > 1)
+ printf(" %X..%X gen %d\n", first, last, gen);
+ if (!utf32valid(first) || !utf32valid(last))
+ line_fail(age_name, line);
+ continue;
+ }
+ ret = sscanf(line, "%X ; %d.%d #", &unichar, &major, &minor);
+ if (ret == 3) {
+ unicode_data[unichar].gen = gen;
+ count++;
+ if (verbose > 1)
+ printf(" %X gen %d\n", unichar, gen);
+ if (!utf32valid(unichar))
+ line_fail(age_name, line);
+ continue;
+ }
+ }
+ unicode_maxage = ages[gen];
+ fclose(file);
+
+ /* Nix surrogate block */
+ if (verbose > 1)
+ printf(" Removing surrogate block D800..DFFF\n");
+ for (unichar = 0xd800; unichar <= 0xdfff; unichar++)
+ unicode_data[unichar].gen = -1;
+
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(age_name);
+}
+
+static void
+ccc_init(void)
+{
+ FILE *file;
+ unsigned int first;
+ unsigned int last;
+ unsigned int unichar;
+ unsigned int value;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", ccc_name);
+
+ file = fopen(ccc_name, "r");
+ if (!file)
+ open_fail(ccc_name, errno);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X..%X ; %d #", &first, &last, &value);
+ if (ret == 3) {
+ for (unichar = first; unichar <= last; unichar++) {
+ unicode_data[unichar].ccc = value;
+ count++;
+ }
+ if (verbose > 1)
+ printf(" %X..%X ccc %d\n", first, last, value);
+ if (!utf32valid(first) || !utf32valid(last))
+ line_fail(ccc_name, line);
+ continue;
+ }
+ ret = sscanf(line, "%X ; %d #", &unichar, &value);
+ if (ret == 2) {
+ unicode_data[unichar].ccc = value;
+ count++;
+ if (verbose > 1)
+ printf(" %X ccc %d\n", unichar, value);
+ if (!utf32valid(unichar))
+ line_fail(ccc_name, line);
+ continue;
+ }
+ }
+ fclose(file);
+
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(ccc_name);
+}
+
+static void
+nfkdi_init(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ char *s;
+ unsigned int *um;
+ int count;
+ int i;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", data_name);
+ file = fopen(data_name, "r");
+ if (!file)
+ open_fail(data_name, errno);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X;%*[^;];%*[^;];%*[^;];%*[^;];%[^;];",
+ &unichar, buf0);
+ if (ret != 2)
+ continue;
+ if (!utf32valid(unichar))
+ line_fail(data_name, line);
+
+ s = buf0;
+ /* skip over <tag> */
+ if (*s == '<')
+ while (*s++ != ' ')
+ ;
+ /* decode the decomposition into UTF-32 */
+ i = 0;
+ while (*s) {
+ mapping[i] = strtoul(s, &s, 16);
+ if (!utf32valid(mapping[i]))
+ line_fail(data_name, line);
+ i++;
+ }
+ mapping[i++] = 0;
+
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdi = um;
+
+ if (verbose > 1)
+ print_utf32nfkdi(unichar);
+ count++;
+ }
+ fclose(file);
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(data_name);
+}
+
+static void
+nfkdicf_init(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ char status;
+ char *s;
+ unsigned int *um;
+ int i;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", fold_name);
+ file = fopen(fold_name, "r");
+ if (!file)
+ open_fail(fold_name, errno);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X; %c; %[^;];", &unichar, &status, buf0);
+ if (ret != 3)
+ continue;
+ if (!utf32valid(unichar))
+ line_fail(fold_name, line);
+ /* Use the C+F casefold. */
+ if (status != 'C' && status != 'F')
+ continue;
+ s = buf0;
+ if (*s == '<')
+ while (*s++ != ' ')
+ ;
+ i = 0;
+ while (*s) {
+ mapping[i] = strtoul(s, &s, 16);
+ if (!utf32valid(mapping[i]))
+ line_fail(fold_name, line);
+ i++;
+ }
+ mapping[i++] = 0;
+
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdicf = um;
+
+ if (verbose > 1)
+ print_utf32nfkdicf(unichar);
+ count++;
+ }
+ fclose(file);
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(fold_name);
+}
+
+static void
+ignore_init(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ unsigned int first;
+ unsigned int last;
+ unsigned int *um;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", prop_name);
+ file = fopen(prop_name, "r");
+ if (!file)
+ open_fail(prop_name, errno);
+ assert(file);
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X..%X ; %s # ", &first, &last, buf0);
+ if (ret == 3) {
+ if (strcmp(buf0, "Default_Ignorable_Code_Point"))
+ continue;
+ if (!utf32valid(first) || !utf32valid(last))
+ line_fail(prop_name, line);
+ for (unichar = first; unichar <= last; unichar++) {
+ free(unicode_data[unichar].utf32nfkdi);
+ um = malloc(sizeof(unsigned int));
+ *um = 0;
+ unicode_data[unichar].utf32nfkdi = um;
+ free(unicode_data[unichar].utf32nfkdicf);
+ um = malloc(sizeof(unsigned int));
+ *um = 0;
+ unicode_data[unichar].utf32nfkdicf = um;
+ count++;
+ }
+ if (verbose > 1)
+ printf(" %X..%X Default_Ignorable_Code_Point\n",
+ first, last);
+ continue;
+ }
+ ret = sscanf(line, "%X ; %s # ", &unichar, buf0);
+ if (ret == 2) {
+ if (strcmp(buf0, "Default_Ignorable_Code_Point"))
+ continue;
+ if (!utf32valid(unichar))
+ line_fail(prop_name, line);
+ free(unicode_data[unichar].utf32nfkdi);
+ um = malloc(sizeof(unsigned int));
+ *um = 0;
+ unicode_data[unichar].utf32nfkdi = um;
+ free(unicode_data[unichar].utf32nfkdicf);
+ um = malloc(sizeof(unsigned int));
+ *um = 0;
+ unicode_data[unichar].utf32nfkdicf = um;
+ if (verbose > 1)
+ printf(" %X Default_Ignorable_Code_Point\n",
+ unichar);
+ count++;
+ continue;
+ }
+ }
+ fclose(file);
+
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(prop_name);
+}
+
+static void
+corrections_init(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ unsigned int major;
+ unsigned int minor;
+ unsigned int revision;
+ unsigned int age;
+ unsigned int *um;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ char *s;
+ int i;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", norm_name);
+ file = fopen(norm_name, "r");
+ if (!file)
+ open_fail(norm_name, errno);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #",
+ &unichar, buf0, buf1,
+ &major, &minor, &revision);
+ if (ret != 6)
+ continue;
+ if (!utf32valid(unichar) || !age_valid(major, minor, revision))
+ line_fail(norm_name, line);
+ count++;
+ }
+ corrections = calloc(count, sizeof(struct unicode_data));
+ corrections_count = count;
+ rewind(file);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #",
+ &unichar, buf0, buf1,
+ &major, &minor, &revision);
+ if (ret != 6)
+ continue;
+ if (!utf32valid(unichar) || !age_valid(major, minor, revision))
+ line_fail(norm_name, line);
+ corrections[count] = unicode_data[unichar];
+ assert(corrections[count].code == unichar);
+ age = UNICODE_AGE(major, minor, revision);
+ corrections[count].correction = age;
+
+ i = 0;
+ s = buf0;
+ while (*s) {
+ mapping[i] = strtoul(s, &s, 16);
+ if (!utf32valid(mapping[i]))
+ line_fail(norm_name, line);
+ i++;
+ }
+ mapping[i++] = 0;
+
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ corrections[count].utf32nfkdi = um;
+
+ if (verbose > 1)
+ printf(" %X -> %s -> %s V%d_%d_%d\n",
+ unichar, buf0, buf1, major, minor, revision);
+ count++;
+ }
+ fclose(file);
+
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(norm_name);
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0)
+ *
+ * AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
+ * D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
+ *
+ * SBase = 0xAC00
+ * LBase = 0x1100
+ * VBase = 0x1161
+ * TBase = 0x11A7
+ * LCount = 19
+ * VCount = 21
+ * TCount = 28
+ * NCount = 588 (VCount * TCount)
+ * SCount = 11172 (LCount * NCount)
+ *
+ * Decomposition:
+ * SIndex = s - SBase
+ *
+ * LV (Canonical/Full)
+ * LIndex = SIndex / NCount
+ * VIndex = (Sindex % NCount) / TCount
+ * LPart = LBase + LIndex
+ * VPart = VBase + VIndex
+ *
+ * LVT (Canonical)
+ * LVIndex = (SIndex / TCount) * TCount
+ * TIndex = (Sindex % TCount
+ * LVPart = LBase + LVIndex
+ * TPart = TBase + TIndex
+ *
+ * LVT (Full)
+ * LIndex = SIndex / NCount
+ * VIndex = (Sindex % NCount) / TCount
+ * TIndex = (Sindex % TCount
+ * LPart = LBase + LIndex
+ * VPart = VBase + VIndex
+ * if (TIndex == 0) {
+ * d = <LPart, VPart>
+ * } else {
+ * TPart = TBase + TIndex
+ * d = <LPart, TPart, VPart>
+ * }
+ *
+ */
+
+static void
+hangul_decompose(void)
+{
+ unsigned int sb = 0xAC00;
+ unsigned int lb = 0x1100;
+ unsigned int vb = 0x1161;
+ unsigned int tb = 0x11a7;
+ /* unsigned int lc = 19; */
+ unsigned int vc = 21;
+ unsigned int tc = 28;
+ unsigned int nc = (vc * tc);
+ /* unsigned int sc = (lc * nc); */
+ unsigned int unichar;
+ unsigned int mapping[4];
+ unsigned int *um;
+ int count;
+ int i;
+
+ if (verbose > 0)
+ printf("Decomposing hangul\n");
+ /* Hangul */
+ count = 0;
+ for (unichar = 0xAC00; unichar <= 0xD7A3; unichar++) {
+ unsigned int si = unichar - sb;
+ unsigned int li = si / nc;
+ unsigned int vi = (si % nc) / tc;
+ unsigned int ti = si % tc;
+
+ i = 0;
+ mapping[i++] = lb + li;
+ mapping[i++] = vb + vi;
+ if (ti)
+ mapping[i++] = tb + ti;
+ mapping[i++] = 0;
+
+ assert(!unicode_data[unichar].utf32nfkdi);
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdi = um;
+
+ assert(!unicode_data[unichar].utf32nfkdicf);
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdicf = um;
+
+ if (verbose > 1)
+ print_utf32nfkdi(unichar);
+
+ count++;
+ }
+ if (verbose > 0)
+ printf("Created %d entries\n", count);
+}
+
+static void
+nfkdi_decompose(void)
+{
+ unsigned int unichar;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ unsigned int *um;
+ unsigned int *dc;
+ int count;
+ int i;
+ int j;
+ int ret;
+
+ if (verbose > 0)
+ printf("Decomposing nfkdi\n");
+
+ count = 0;
+ for (unichar = 0; unichar != 0x110000; unichar++) {
+ if (!unicode_data[unichar].utf32nfkdi)
+ continue;
+ for (;;) {
+ ret = 1;
+ i = 0;
+ um = unicode_data[unichar].utf32nfkdi;
+ while (*um) {
+ dc = unicode_data[*um].utf32nfkdi;
+ if (dc) {
+ for (j = 0; dc[j]; j++)
+ mapping[i++] = dc[j];
+ ret = 0;
+ } else {
+ mapping[i++] = *um;
+ }
+ um++;
+ }
+ mapping[i++] = 0;
+ if (ret)
+ break;
+ free(unicode_data[unichar].utf32nfkdi);
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdi = um;
+ }
+ /* Add this decomposition to nfkdicf if there is no entry. */
+ if (!unicode_data[unichar].utf32nfkdicf) {
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdicf = um;
+ }
+ if (verbose > 1)
+ print_utf32nfkdi(unichar);
+ count++;
+ }
+ if (verbose > 0)
+ printf("Processed %d entries\n", count);
+}
+
+static void
+nfkdicf_decompose(void)
+{
+ unsigned int unichar;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ unsigned int *um;
+ unsigned int *dc;
+ int count;
+ int i;
+ int j;
+ int ret;
+
+ if (verbose > 0)
+ printf("Decomposing nfkdicf\n");
+ count = 0;
+ for (unichar = 0; unichar != 0x110000; unichar++) {
+ if (!unicode_data[unichar].utf32nfkdicf)
+ continue;
+ for (;;) {
+ ret = 1;
+ i = 0;
+ um = unicode_data[unichar].utf32nfkdicf;
+ while (*um) {
+ dc = unicode_data[*um].utf32nfkdicf;
+ if (dc) {
+ for (j = 0; dc[j]; j++)
+ mapping[i++] = dc[j];
+ ret = 0;
+ } else {
+ mapping[i++] = *um;
+ }
+ um++;
+ }
+ mapping[i++] = 0;
+ if (ret)
+ break;
+ free(unicode_data[unichar].utf32nfkdicf);
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdicf = um;
+ }
+ if (verbose > 1)
+ print_utf32nfkdicf(unichar);
+ count++;
+ }
+ if (verbose > 0)
+ printf("Processed %d entries\n", count);
+}
+
+/* ------------------------------------------------------------------ */
+
+int utf8agemax(struct tree *, const char *);
+int utf8nagemax(struct tree *, const char *, size_t);
+int utf8agemin(struct tree *, const char *);
+int utf8nagemin(struct tree *, const char *, size_t);
+ssize_t utf8len(struct tree *, const char *);
+ssize_t utf8nlen(struct tree *, const char *, size_t);
+struct utf8cursor;
+int utf8cursor(struct utf8cursor *, struct tree *, const char *);
+int utf8ncursor(struct utf8cursor *, struct tree *, const char *, size_t);
+int utf8byte(struct utf8cursor *);
+
+/*
+ * Use trie to scan s, touching at most len bytes.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * A non-NULL return guarantees that the UTF-8 sequence starting at s
+ * is well-formed and corresponds to a known unicode code point. The
+ * shorthand for this will be "is valid UTF-8 unicode".
+ */
+static utf8leaf_t *
+utf8nlookup(struct tree *tree, const char *s, size_t len)
+{
+ utf8trie_t *trie = utf8data + tree->index;
+ int offlen;
+ int offset;
+ int mask;
+ int node;
+
+ if (!tree)
+ return NULL;
+ if (len == 0)
+ return NULL;
+ node = 1;
+ while (node) {
+ offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT;
+ if (*trie & NEXTBYTE) {
+ if (--len == 0)
+ return NULL;
+ s++;
+ }
+ mask = 1 << (*trie & BITNUM);
+ if (*s & mask) {
+ /* Right leg */
+ if (offlen) {
+ /* Right node at offset of trie */
+ node = (*trie & RIGHTNODE);
+ offset = trie[offlen];
+ while (--offlen) {
+ offset <<= 8;
+ offset |= trie[offlen];
+ }
+ trie += offset;
+ } else if (*trie & RIGHTPATH) {
+ /* Right node after this node */
+ node = (*trie & TRIENODE);
+ trie++;
+ } else {
+ /* No right node. */
+ node = 0;
+ trie = NULL;
+ }
+ } else {
+ /* Left leg */
+ if (offlen) {
+ /* Left node after this node. */
+ node = (*trie & LEFTNODE);
+ trie += offlen + 1;
+ } else if (*trie & RIGHTPATH) {
+ /* No left node. */
+ node = 0;
+ trie = NULL;
+ } else {
+ /* Left node after this node */
+ node = (*trie & TRIENODE);
+ trie++;
+ }
+ }
+ }
+ return trie;
+}
+
+/*
+ * Use trie to scan s.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * Forwards to trie_nlookup().
+ */
+static utf8leaf_t *
+utf8lookup(struct tree *tree, const char *s)
+{
+ return utf8nlookup(tree, s, (size_t)-1);
+}
+
+/*
+ * Return the number of bytes used by the current UTF-8 sequence.
+ * Assumes the input points to the first byte of a valid UTF-8
+ * sequence.
+ */
+static inline int
+utf8clen(const char *s)
+{
+ unsigned char c = *s;
+ return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0);
+}
+
+/*
+ * Maximum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if only non-assigned code points are used.
+ */
+int
+utf8agemax(struct tree *tree, const char *s)
+{
+ utf8leaf_t *leaf;
+ int age = 0;
+ int leaf_age;
+
+ if (!tree)
+ return -1;
+ while (*s) {
+ if (!(leaf = utf8lookup(tree, s)))
+ return -1;
+ leaf_age = ages[LEAF_GEN(leaf)];
+ if (leaf_age <= tree->maxage && leaf_age > age)
+ age = leaf_age;
+ s += utf8clen(s);
+ }
+ return age;
+}
+
+/*
+ * Minimum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if non-assigned code points are used.
+ */
+int
+utf8agemin(struct tree *tree, const char *s)
+{
+ utf8leaf_t *leaf;
+ int age = tree->maxage;
+ int leaf_age;
+
+ if (!tree)
+ return -1;
+ while (*s) {
+ if (!(leaf = utf8lookup(tree, s)))
+ return -1;
+ leaf_age = ages[LEAF_GEN(leaf)];
+ if (leaf_age <= tree->maxage && leaf_age < age)
+ age = leaf_age;
+ s += utf8clen(s);
+ }
+ return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemax(struct tree *tree, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ int age = 0;
+ int leaf_age;
+
+ if (!tree)
+ return -1;
+ while (len && *s) {
+ if (!(leaf = utf8nlookup(tree, s, len)))
+ return -1;
+ leaf_age = ages[LEAF_GEN(leaf)];
+ if (leaf_age <= tree->maxage && leaf_age > age)
+ age = leaf_age;
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemin(struct tree *tree, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ int leaf_age;
+ int age = tree->maxage;
+
+ if (!tree)
+ return -1;
+ while (len && *s) {
+ if (!(leaf = utf8nlookup(tree, s, len)))
+ return -1;
+ leaf_age = ages[LEAF_GEN(leaf)];
+ if (leaf_age <= tree->maxage && leaf_age < age)
+ age = leaf_age;
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return age;
+}
+
+/*
+ * Length of the normalization of s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ *
+ * A string of Default_Ignorable_Code_Point has length 0.
+ */
+ssize_t
+utf8len(struct tree *tree, const char *s)
+{
+ utf8leaf_t *leaf;
+ size_t ret = 0;
+
+ if (!tree)
+ return -1;
+ while (*s) {
+ if (!(leaf = utf8lookup(tree, s)))
+ return -1;
+ if (ages[LEAF_GEN(leaf)] > tree->maxage)
+ ret += utf8clen(s);
+ else if (LEAF_CCC(leaf) == DECOMPOSE)
+ ret += strlen(LEAF_STR(leaf));
+ else
+ ret += utf8clen(s);
+ s += utf8clen(s);
+ }
+ return ret;
+}
+
+/*
+ * Length of the normalization of s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+ssize_t
+utf8nlen(struct tree *tree, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ size_t ret = 0;
+
+ if (!tree)
+ return -1;
+ while (len && *s) {
+ if (!(leaf = utf8nlookup(tree, s, len)))
+ return -1;
+ if (ages[LEAF_GEN(leaf)] > tree->maxage)
+ ret += utf8clen(s);
+ else if (LEAF_CCC(leaf) == DECOMPOSE)
+ ret += strlen(LEAF_STR(leaf));
+ else
+ ret += utf8clen(s);
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return ret;
+}
+
+/*
+ * Cursor structure used by the normalizer.
+ */
+struct utf8cursor {
+ struct tree *tree;
+ const char *s;
+ const char *p;
+ const char *ss;
+ const char *sp;
+ unsigned int len;
+ unsigned int slen;
+ short int ccc;
+ short int nccc;
+ unsigned int unichar;
+};
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ * s : string.
+ * len : length of s.
+ * u8c : pointer to cursor.
+ * trie : utf8trie_t to use for normalization.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8ncursor(
+ struct utf8cursor *u8c,
+ struct tree *tree,
+ const char *s,
+ size_t len)
+{
+ if (!tree)
+ return -1;
+ if (!s)
+ return -1;
+ u8c->tree = tree;
+ u8c->s = s;
+ u8c->p = NULL;
+ u8c->ss = NULL;
+ u8c->sp = NULL;
+ u8c->len = len;
+ u8c->slen = 0;
+ u8c->ccc = STOPPER;
+ u8c->nccc = STOPPER;
+ u8c->unichar = 0;
+ /* Check we didn't clobber the maximum length. */
+ if (u8c->len != len)
+ return -1;
+ /* The first byte of s may not be an utf8 continuation. */
+ if (len > 0 && (*s & 0xC0) == 0x80)
+ return -1;
+ return 0;
+}
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ * s : NUL-terminated string.
+ * u8c : pointer to cursor.
+ * trie : utf8trie_t to use for normalization.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8cursor(
+ struct utf8cursor *u8c,
+ struct tree *tree,
+ const char *s)
+{
+ return utf8ncursor(u8c, tree, s, (unsigned int)-1);
+}
+
+/*
+ * Get one byte from the normalized form of the string described by u8c.
+ *
+ * Returns the byte cast to an unsigned char on succes, and -1 on failure.
+ *
+ * The cursor keeps track of the location in the string in u8c->s.
+ * When a character is decomposed, the current location is stored in
+ * u8c->p, and u8c->s is set to the start of the decomposition. Note
+ * that bytes from a decomposition do not count against u8c->len.
+ *
+ * Characters are emitted if they match the current CCC in u8c->ccc.
+ * Hitting end-of-string while u8c->ccc == STOPPER means we're done,
+ * and the function returns 0 in that case.
+ *
+ * Sorting by CCC is done by repeatedly scanning the string. The
+ * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at
+ * the start of the scan. The first pass finds the lowest CCC to be
+ * emitted and stores it in u8c->nccc, the second pass emits the
+ * characters with this CCC and finds the next lowest CCC. This limits
+ * the number of passes to 1 + the number of different CCCs in the
+ * sequence being scanned.
+ *
+ * Therefore:
+ * u8c->p != NULL -> a decomposition is being scanned.
+ * u8c->ss != NULL -> this is a repeating scan.
+ * u8c->ccc == -1 -> this is the first scan of a repeating scan.
+ */
+int
+utf8byte(struct utf8cursor *u8c)
+{
+ utf8leaf_t *leaf;
+ int ccc;
+
+ for (;;) {
+ /* Check for the end of a decomposed character. */
+ if (u8c->p && *u8c->s == '\0') {
+ u8c->s = u8c->p;
+ u8c->p = NULL;
+ }
+
+ /* Check for end-of-string. */
+ if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) {
+ /* There is no next byte. */
+ if (u8c->ccc == STOPPER)
+ return 0;
+ /* End-of-string during a scan counts as a stopper. */
+ ccc = STOPPER;
+ goto ccc_mismatch;
+ } else if ((*u8c->s & 0xC0) == 0x80) {
+ /* This is a continuation of the current character. */
+ if (!u8c->p)
+ u8c->len--;
+ return (unsigned char)*u8c->s++;
+ }
+
+ /* Look up the data for the current character. */
+ if (u8c->p)
+ leaf = utf8lookup(u8c->tree, u8c->s);
+ else
+ leaf = utf8nlookup(u8c->tree, u8c->s, u8c->len);
+
+ /* No leaf found implies that the input is a binary blob. */
+ if (!leaf)
+ return -1;
+
+ /* Characters that are too new have CCC 0. */
+ if (ages[LEAF_GEN(leaf)] > u8c->tree->maxage) {
+ ccc = STOPPER;
+ } else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) {
+ u8c->len -= utf8clen(u8c->s);
+ u8c->p = u8c->s + utf8clen(u8c->s);
+ u8c->s = LEAF_STR(leaf);
+ /* Empty decomposition implies CCC 0. */
+ if (*u8c->s == '\0') {
+ if (u8c->ccc == STOPPER)
+ continue;
+ ccc = STOPPER;
+ goto ccc_mismatch;
+ }
+ leaf = utf8lookup(u8c->tree, u8c->s);
+ ccc = LEAF_CCC(leaf);
+ }
+ u8c->unichar = utf8code(u8c->s);
+
+ /*
+ * If this is not a stopper, then see if it updates
+ * the next canonical class to be emitted.
+ */
+ if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc)
+ u8c->nccc = ccc;
+
+ /*
+ * Return the current byte if this is the current
+ * combining class.
+ */
+ if (ccc == u8c->ccc) {
+ if (!u8c->p)
+ u8c->len--;
+ return (unsigned char)*u8c->s++;
+ }
+
+ /* Current combining class mismatch. */
+ ccc_mismatch:
+ if (u8c->nccc == STOPPER) {
+ /*
+ * Scan forward for the first canonical class
+ * to be emitted. Save the position from
+ * which to restart.
+ */
+ assert(u8c->ccc == STOPPER);
+ u8c->ccc = MINCCC - 1;
+ u8c->nccc = ccc;
+ u8c->sp = u8c->p;
+ u8c->ss = u8c->s;
+ u8c->slen = u8c->len;
+ if (!u8c->p)
+ u8c->len -= utf8clen(u8c->s);
+ u8c->s += utf8clen(u8c->s);
+ } else if (ccc != STOPPER) {
+ /* Not a stopper, and not the ccc we're emitting. */
+ if (!u8c->p)
+ u8c->len -= utf8clen(u8c->s);
+ u8c->s += utf8clen(u8c->s);
+ } else if (u8c->nccc != MAXCCC + 1) {
+ /* At a stopper, restart for next ccc. */
+ u8c->ccc = u8c->nccc;
+ u8c->nccc = MAXCCC + 1;
+ u8c->s = u8c->ss;
+ u8c->p = u8c->sp;
+ u8c->len = u8c->slen;
+ } else {
+ /* All done, proceed from here. */
+ u8c->ccc = STOPPER;
+ u8c->nccc = STOPPER;
+ u8c->sp = NULL;
+ u8c->ss = NULL;
+ u8c->slen = 0;
+ }
+ }
+}
+
+/* ------------------------------------------------------------------ */
+
+static int
+normalize_line(struct tree *tree)
+{
+ char *s;
+ char *t;
+ int c;
+ struct utf8cursor u8c;
+
+ /* First test: null-terminated string. */
+ s = buf2;
+ t = buf3;
+ if (utf8cursor(&u8c, tree, s))
+ return -1;
+ while ((c = utf8byte(&u8c)) > 0)
+ if (c != (unsigned char)*t++)
+ return -1;
+ if (c < 0)
+ return -1;
+ if (*t != 0)
+ return -1;
+
+ /* Second test: length-limited string. */
+ s = buf2;
+ /* Replace NUL with a value that will cause an error if seen. */
+ s[strlen(s) + 1] = -1;
+ t = buf3;
+ if (utf8cursor(&u8c, tree, s))
+ return -1;
+ while ((c = utf8byte(&u8c)) > 0)
+ if (c != (unsigned char)*t++)
+ return -1;
+ if (c < 0)
+ return -1;
+ if (*t != 0)
+ return -1;
+
+ return 0;
+}
+
+static void
+normalization_test(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ struct unicode_data *data;
+ char *s;
+ char *t;
+ int ret;
+ int ignorables;
+ int tests = 0;
+ int failures = 0;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", test_name);
+ /* Step one, read data from file. */
+ file = fopen(test_name, "r");
+ if (!file)
+ open_fail(test_name, errno);
+
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%[^;];%*[^;];%*[^;];%*[^;];%[^;];",
+ buf0, buf1);
+ if (ret != 2 || *line == '#')
+ continue;
+ s = buf0;
+ t = buf2;
+ while (*s) {
+ unichar = strtoul(s, &s, 16);
+ t += utf8key(unichar, t);
+ }
+ *t = '\0';
+
+ ignorables = 0;
+ s = buf1;
+ t = buf3;
+ while (*s) {
+ unichar = strtoul(s, &s, 16);
+ data = &unicode_data[unichar];
+ if (data->utf8nfkdi && !*data->utf8nfkdi)
+ ignorables = 1;
+ else
+ t += utf8key(unichar, t);
+ }
+ *t = '\0';
+
+ tests++;
+ if (normalize_line(nfkdi_tree) < 0) {
+ printf("\nline %s -> %s", buf0, buf1);
+ if (ignorables)
+ printf(" (ignorables removed)");
+ printf(" failure\n");
+ failures++;
+ }
+ }
+ fclose(file);
+ if (verbose > 0)
+ printf("Ran %d tests with %d failures\n", tests, failures);
+ if (failures)
+ file_fail(test_name);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+write_file(void)
+{
+ FILE *file;
+ int i;
+ int j;
+ int t;
+ int gen;
+
+ if (verbose > 0)
+ printf("Writing %s\n", utf8_name);
+ file = fopen(utf8_name, "w");
+ if (!file)
+ open_fail(utf8_name, errno);
+
+ fprintf(file, "/* This file is generated code, do not edit. */\n");
+ fprintf(file, "#ifndef __INCLUDED_FROM_UTF8NORM_C__\n");
+ fprintf(file, "#error Only xfs_utf8.c may include this file.\n");
+ fprintf(file, "#endif\n");
+ fprintf(file, "\n");
+ fprintf(file, "const unsigned int utf8version = %#x;\n",
+ unicode_maxage);
+ fprintf(file, "\n");
+ fprintf(file, "static const unsigned int utf8agetab[] = {\n");
+ for (i = 0; i != ages_count; i++)
+ fprintf(file, "\t%#x%s\n", ages[i],
+ ages[i] == unicode_maxage ? "" : ",");
+ fprintf(file, "};\n");
+ fprintf(file, "\n");
+ fprintf(file, "static const struct utf8data utf8nfkdicfdata[] = {\n");
+ t = 0;
+ for (gen = 0; gen < ages_count; gen++) {
+ fprintf(file, "\t{ %#x, %d }%s\n",
+ ages[gen], trees[t].index,
+ ages[gen] == unicode_maxage ? "" : ",");
+ if (trees[t].maxage == ages[gen])
+ t += 2;
+ }
+ fprintf(file, "};\n");
+ fprintf(file, "\n");
+ fprintf(file, "static const struct utf8data utf8nfkdidata[] = {\n");
+ t = 1;
+ for (gen = 0; gen < ages_count; gen++) {
+ fprintf(file, "\t{ %#x, %d }%s\n",
+ ages[gen], trees[t].index,
+ ages[gen] == unicode_maxage ? "" : ",");
+ if (trees[t].maxage == ages[gen])
+ t += 2;
+ }
+ fprintf(file, "};\n");
+ fprintf(file, "\n");
+ fprintf(file, "static const unsigned char utf8data[%zd] = {\n",
+ utf8data_size);
+ t = 0;
+ for (i = 0; i != utf8data_size; i += 16) {
+ if (i == trees[t].index) {
+ fprintf(file, "\t/* %s_%x */\n",
+ trees[t].type, trees[t].maxage);
+ if (t < trees_count-1)
+ t++;
+ }
+ fprintf(file, "\t");
+ for (j = i; j != i + 16; j++)
+ fprintf(file, "0x%.2x%s", utf8data[j],
+ (j < utf8data_size -1 ? "," : ""));
+ fprintf(file, "\n");
+ }
+ fprintf(file, "};\n");
+ fclose(file);
+}
+
+/* ------------------------------------------------------------------ */
+
+int
+main(int argc, char *argv[])
+{
+ unsigned int unichar;
+ int opt;
+
+ argv0 = argv[0];
+
+ while ((opt = getopt(argc, argv, "a:c:d:f:hn:o:p:t:v")) != -1) {
+ switch (opt) {
+ case 'a':
+ age_name = optarg;
+ break;
+ case 'c':
+ ccc_name = optarg;
+ break;
+ case 'd':
+ data_name = optarg;
+ break;
+ case 'f':
+ fold_name = optarg;
+ break;
+ case 'n':
+ norm_name = optarg;
+ break;
+ case 'o':
+ utf8_name = optarg;
+ break;
+ case 'p':
+ prop_name = optarg;
+ break;
+ case 't':
+ test_name = optarg;
+ break;
+ case 'v':
+ verbose++;
+ break;
+ case 'h':
+ help();
+ exit(0);
+ default:
+ usage();
+ }
+ }
+
+ if (verbose > 1)
+ help();
+ for (unichar = 0; unichar != 0x110000; unichar++)
+ unicode_data[unichar].code = unichar;
+ age_init();
+ ccc_init();
+ nfkdi_init();
+ nfkdicf_init();
+ ignore_init();
+ corrections_init();
+ hangul_decompose();
+ nfkdi_decompose();
+ nfkdicf_decompose();
+ utf8_init();
+ trees_init();
+ trees_populate();
+ trees_reduce();
+ trees_verify();
+ /* Prevent "unused function" warning. */
+ (void)lookup(nfkdi_tree, " ");
+ if (verbose > 2)
+ tree_walk(nfkdi_tree);
+ if (verbose > 2)
+ tree_walk(nfkdicf_tree);
+ normalization_test();
+ write_file();
+
+ return 0;
+}
--
1.7.12.4
Ben Myers
2014-09-11 21:00:09 UTC
Permalink
From: Olaf Weber <***@sgi.com>

The xfs_utf8_nameops use the nfkdi normalization when comparing filenames,
and are installed if the utf8bit is set in the super block.

The xfs_utf8_ci_nameops use the nfkdicf normalization when comparing
filenames, and are installed if both the utf8bit and the borgbit are set
in the superblock.

Normalized filenames are not stored on disk. Normalization will fail if a
filename is not valid UTF-8, in which case the filename is treated as an
opaque blob.

Changes:
Type conversion to "(const char *)" added to utf8ncursor() and utf8nlen()
calls.

Signed-off-by: Olaf Weber <***@sgi.com>
---
Makefile | 2 +-
include/libxfs.h | 1 +
include/xfs_utf8.h | 25 ++++++
libxfs/Makefile | 4 +-
libxfs/xfs_dir2.c | 15 +++-
libxfs/xfs_utf8.c | 238 +++++++++++++++++++++++++++++++++++++++++++++++++++++
support/Makefile | 24 ++++++
7 files changed, 303 insertions(+), 6 deletions(-)
create mode 100644 include/xfs_utf8.h
create mode 100644 libxfs/xfs_utf8.c
create mode 100644 support/Makefile

diff --git a/Makefile b/Makefile
index f56aebd..c442da6 100644
--- a/Makefile
+++ b/Makefile
@@ -40,7 +40,7 @@ LDIRDIRT = $(SRCDIR)
LDIRT += $(SRCTAR)
endif

-LIB_SUBDIRS = libxfs libxlog libxcmd libhandle libdisk
+LIB_SUBDIRS = support libxfs libxlog libxcmd libhandle libdisk
TOOL_SUBDIRS = copy db estimate fsck fsr growfs io logprint mkfs quota \
mdrestore repair rtcp m4 man doc po debian

diff --git a/include/libxfs.h b/include/libxfs.h
index 45a924f..99cb3d9 100644
--- a/include/libxfs.h
+++ b/include/libxfs.h
@@ -59,6 +59,7 @@
#include <xfs/xfs_btree_trace.h>
#include <xfs/xfs_bmap.h>
#include <xfs/xfs_trace.h>
+#include <xfs_utf8.h>


#ifndef ARRAY_SIZE
diff --git a/include/xfs_utf8.h b/include/xfs_utf8.h
new file mode 100644
index 0000000..97b6a91
--- /dev/null
+++ b/include/xfs_utf8.h
@@ -0,0 +1,25 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef XFS_UTF8_H
+#define XFS_UTF8_H
+
+extern struct xfs_nameops xfs_utf8_nameops;
+extern struct xfs_nameops xfs_utf8_ci_nameops;
+
+#endif /* XFS_UTF8_H */
diff --git a/libxfs/Makefile b/libxfs/Makefile
index ae15a5d..d836027 100644
--- a/libxfs/Makefile
+++ b/libxfs/Makefile
@@ -14,6 +14,7 @@ HFILES = xfs.h init.h xfs_dir2_priv.h crc32defs.h crc32table.h
CFILES = cache.c \
crc32.c \
init.c kmem.c logitem.c radix-tree.c rdwr.c trans.c util.c \
+ utf8norm.c \
xfs_alloc.c \
xfs_alloc_btree.c \
xfs_attr.c \
@@ -38,7 +39,8 @@ CFILES = cache.c \
xfs_rtbitmap.c \
xfs_sb.c \
xfs_symlink_remote.c \
- xfs_trans_resv.c
+ xfs_trans_resv.c \
+ xfs_utf8.c

CFILES += $(PKG_PLATFORM).c
PCFILES = darwin.c freebsd.c irix.c linux.c
diff --git a/libxfs/xfs_dir2.c b/libxfs/xfs_dir2.c
index 1893931..6872844 100644
--- a/libxfs/xfs_dir2.c
+++ b/libxfs/xfs_dir2.c
@@ -123,10 +123,17 @@ xfs_dir_mount(
(uint)sizeof(xfs_da_node_entry_t);

mp->m_dir_magicpct = (mp->m_dirblksize * 37) / 100;
- if (xfs_sb_version_hasasciici(&mp->m_sb))
- mp->m_dirnameops = &xfs_ascii_ci_nameops;
- else
- mp->m_dirnameops = &xfs_default_nameops;
+ if (xfs_sb_version_hasutf8(&mp->m_sb)) {
+ if (xfs_sb_version_hasasciici(&mp->m_sb))
+ mp->m_dirnameops = &xfs_utf8_ci_nameops;
+ else
+ mp->m_dirnameops = &xfs_utf8_nameops;
+ } else {
+ if (xfs_sb_version_hasasciici(&mp->m_sb))
+ mp->m_dirnameops = &xfs_ascii_ci_nameops;
+ else
+ mp->m_dirnameops = &xfs_default_nameops;
+ }
}

/*
diff --git a/libxfs/xfs_utf8.c b/libxfs/xfs_utf8.c
new file mode 100644
index 0000000..f5cc231
--- /dev/null
+++ b/libxfs/xfs_utf8.c
@@ -0,0 +1,238 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_types.h"
+#include "xfs_bit.h"
+#include "xfs_inum.h"
+#include "xfs_sb.h"
+#include "xfs_ag.h"
+#include "xfs_dir2.h"
+#include "xfs_da_btree.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_alloc_btree.h"
+#include "xfs_dinode.h"
+#include "xfs_inode_fork.h"
+#include "xfs_bmap.h"
+#include "xfs_dir2.h"
+#include "xfs_trace.h"
+#include "xfs_utf8.h"
+#include "utf8norm.h"
+
+/*
+ * xfs nameops using nfkdi
+ */
+
+static xfs_dahash_t
+xfs_utf8_hashname(
+ const unsigned char *name,
+ int len)
+{
+ utf8data_t nfkdi;
+ struct utf8cursor u8c;
+ xfs_dahash_t hash;
+ int val;
+
+ nfkdi = utf8nfkdi(utf8version);
+ hash = 0;
+ if (utf8ncursor(&u8c, nfkdi, (const char *)name, len) < 0)
+ goto blob;
+ while ((val = utf8byte(&u8c)) > 0)
+ hash = val ^ rol32(hash, 7);
+ /* In case of error treat the name as a binary blob. */
+ if (val == 0)
+ return hash;
+blob:
+ return xfs_da_hashname(name, len);
+}
+
+static int
+xfs_utf8_normhash(
+ struct xfs_da_args *args)
+{
+ utf8data_t nfkdi;
+ struct utf8cursor u8c;
+ unsigned char *norm;
+ ssize_t normlen;
+ int c;
+
+ nfkdi = utf8nfkdi(utf8version);
+ /* Failure to normalize is treated as a blob. */
+ if ((normlen = utf8nlen(nfkdi, (const char *)args->name,
+ args->namelen)) < 0)
+ goto blob;
+ if (utf8ncursor(&u8c, nfkdi, (const char *)args->name,
+ args->namelen) < 0)
+ goto blob;
+ if (!(norm = kmem_alloc(normlen + 1, KM_NOFS|KM_MAYFAIL)))
+ return ENOMEM;
+ args->norm = norm;
+ args->normlen = normlen;
+ while ((c = utf8byte(&u8c)) > 0)
+ *norm++ = c;
+ if (c == 0) {
+ *norm = '\0';
+ args->hashval = xfs_da_hashname(args->norm, args->normlen);
+ return 0;
+ }
+ kmem_free((void *)args->norm);
+blob:
+ args->norm = NULL;
+ args->normlen = -1;
+ args->hashval = xfs_da_hashname(args->name, args->namelen);
+ return 0;
+}
+
+static enum xfs_dacmp
+xfs_utf8_compname(
+ struct xfs_da_args *args,
+ const unsigned char *name,
+ int len)
+{
+ utf8data_t nfkdi;
+ struct utf8cursor u8c;
+ const char *norm;
+ int c;
+
+ ASSERT(args->norm || args->normlen == -1);
+
+ /* Check for an exact match first. */
+ if (args->namelen == len && memcmp(args->name, name, len) == 0)
+ return XFS_CMP_EXACT;
+ /* xfs_utf8_normhash() set args->normlen to -1 for a blob */
+ if (args->normlen < 0)
+ return XFS_CMP_DIFFERENT;
+ nfkdi = utf8nfkdi(utf8version);
+ if (utf8ncursor(&u8c, nfkdi, (const char *)name, len) < 0)
+ return XFS_CMP_DIFFERENT;
+ norm = (const char *)args->norm;
+ while ((c = utf8byte(&u8c)) > 0)
+ if (c != *norm++)
+ return XFS_CMP_DIFFERENT;
+ if (c < 0 || *norm != '\0')
+ return XFS_CMP_DIFFERENT;
+ return XFS_CMP_MATCH;
+}
+
+struct xfs_nameops xfs_utf8_nameops = {
+ .hashname = xfs_utf8_hashname,
+ .normhash = xfs_utf8_normhash,
+ .compname = xfs_utf8_compname,
+};
+
+/*
+ * xfs nameops using nfkdicf
+ */
+
+static xfs_dahash_t
+xfs_utf8_ci_hashname(
+ const unsigned char *name,
+ int len)
+{
+ utf8data_t nfkdicf;
+ struct utf8cursor u8c;
+ xfs_dahash_t hash;
+ int val;
+
+ nfkdicf = utf8nfkdicf(utf8version);
+ hash = 0;
+ if (utf8ncursor(&u8c, nfkdicf, (const char *)name, len) < 0)
+ goto blob;
+ while ((val = utf8byte(&u8c)) > 0)
+ hash = val ^ rol32(hash, 7);
+ /* In case of error treat the name as a binary blob. */
+ if (val == 0)
+ return hash;
+blob:
+ return xfs_da_hashname(name, len);
+}
+
+static int
+xfs_utf8_ci_normhash(
+ struct xfs_da_args *args)
+{
+ utf8data_t nfkdicf;
+ struct utf8cursor u8c;
+ unsigned char *norm;
+ ssize_t normlen;
+ int c;
+
+ nfkdicf = utf8nfkdicf(utf8version);
+ /* Failure to normalize is treated as a blob. */
+ if ((normlen = utf8nlen(nfkdicf, (const char *)args->name,
+ args->namelen)) < 0)
+ goto blob;
+ if (utf8ncursor(&u8c, nfkdicf, (const char *)args->name,
+ args->namelen) < 0)
+ goto blob;
+ if (!(norm = kmem_alloc(normlen + 1, KM_NOFS|KM_MAYFAIL)))
+ return ENOMEM;
+ args->norm = norm;
+ args->normlen = normlen;
+ while ((c = utf8byte(&u8c)) > 0)
+ *norm++ = c;
+ if (c == 0) {
+ *norm = '\0';
+ args->hashval = xfs_da_hashname(args->norm, args->normlen);
+ return 0;
+ }
+ kmem_free((void *)args->norm);
+blob:
+ args->norm = NULL;
+ args->normlen = -1;
+ args->hashval = xfs_da_hashname(args->name, args->namelen);
+ return 0;
+}
+
+static enum xfs_dacmp
+xfs_utf8_ci_compname(
+ struct xfs_da_args *args,
+ const unsigned char *name,
+ int len)
+{
+ utf8data_t nfkdicf;
+ struct utf8cursor u8c;
+ const unsigned char *norm;
+ int c;
+
+ ASSERT(args->norm || args->normlen == -1);
+
+ /* Check for an exact match first. */
+ if (args->namelen == len && memcmp(args->name, name, len) == 0)
+ return XFS_CMP_EXACT;
+ /* xfs_utf8_ci_normhash() set args->normlen to -1 for a blob */
+ if (args->normlen < 0)
+ return XFS_CMP_DIFFERENT;
+ nfkdicf = utf8nfkdicf(utf8version);
+ if (utf8ncursor(&u8c, nfkdicf, (const char *)name, len) < 0)
+ return XFS_CMP_DIFFERENT;
+ norm = args->norm;
+ while ((c = utf8byte(&u8c)) > 0)
+ if (c != *norm++)
+ return XFS_CMP_DIFFERENT;
+ if (c < 0 || *norm != '\0')
+ return XFS_CMP_DIFFERENT;
+ return XFS_CMP_MATCH;
+}
+
+struct xfs_nameops xfs_utf8_ci_nameops = {
+ .hashname = xfs_utf8_ci_hashname,
+ .normhash = xfs_utf8_ci_normhash,
+ .compname = xfs_utf8_ci_compname,
+};
diff --git a/support/Makefile b/support/Makefile
new file mode 100644
index 0000000..cade5fe
--- /dev/null
+++ b/support/Makefile
@@ -0,0 +1,24 @@
+#
+# Copyright (c) 2014 SGI. All Rights Reserved.
+#
+
+TOPDIR = ..
+include $(TOPDIR)/include/builddefs
+
+default = ../include/utf8data.h
+
+../include/utf8data.h: mkutf8data.c
+ cc -o mkutf8data mkutf8data.c
+ cd ucd-7.0.0 ; ../mkutf8data
+ mv ucd-7.0.0/utf8data.h ../include
+
+default clean:
+ rm -f mkutf8data ../include/utf8data.h
+
+default install:
+
+default install-dev:
+
+default install-qa:
+
+-include .ltdep
--
1.7.12.4
Ben Myers
2014-09-11 21:01:04 UTC
Permalink
From: Olaf Weber <***@sgi.com>

Apply the same rules for UTF-8 normalization to the names of user-defined
extended attributes. System attributes are excluded because they are not
user-visible in the first place, and the kernel is expected to know what
it is doing when naming them.

Signed-off-by: Olaf Weber <***@sgi.com>
---
libxfs/xfs_attr.c | 49 +++++++++++++++++++++++++++++++++++++++++--------
libxfs/xfs_attr_leaf.c | 11 +++++++++--
libxfs/xfs_utf8.c | 7 +++++++
3 files changed, 57 insertions(+), 10 deletions(-)

diff --git a/libxfs/xfs_attr.c b/libxfs/xfs_attr.c
index 17519d3..c30703b 100644
--- a/libxfs/xfs_attr.c
+++ b/libxfs/xfs_attr.c
@@ -88,8 +88,9 @@ xfs_attr_get_int(
int *valuelenp,
int flags)
{
- xfs_da_args_t args;
- int error;
+ xfs_da_args_t args;
+ struct xfs_mount *mp = ip->i_mount;
+ int error;

if (!xfs_inode_hasattr(ip))
return ENOATTR;
@@ -103,9 +104,12 @@ xfs_attr_get_int(
args.value = value;
args.valuelen = *valuelenp;
args.flags = flags;
- args.hashval = xfs_da_hashname(args.name, args.namelen);
args.dp = ip;
args.whichfork = XFS_ATTR_FORK;
+ if (! xfs_sb_version_hasutf8(&mp->m_sb))
+ args.hashval = xfs_da_hashname(args.name, args.namelen);
+ else if ((error = mp->m_dirnameops->normhash(&args)) != 0)
+ return error;

/*
* Decide on what work routines to call based on the inode size.
@@ -118,6 +122,9 @@ xfs_attr_get_int(
error = xfs_attr_node_get(&args);
}

+ if (args.norm)
+ kmem_free((void *)args.norm);
+
/*
* Return the number of bytes in the value to the caller.
*/
@@ -239,12 +246,15 @@ xfs_attr_set_int(
args.value = value;
args.valuelen = valuelen;
args.flags = flags;
- args.hashval = xfs_da_hashname(args.name, args.namelen);
args.dp = dp;
args.firstblock = &firstblock;
args.flist = &flist;
args.whichfork = XFS_ATTR_FORK;
args.op_flags = XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT;
+ if (! xfs_sb_version_hasutf8(&mp->m_sb))
+ args.hashval = xfs_da_hashname(args.name, args.namelen);
+ else if ((error = mp->m_dirnameops->normhash(&args)) != 0)
+ return error;

/* Size is now blocks for attribute data */
args.total = xfs_attr_calc_size(dp, name->len, valuelen, &local);
@@ -276,6 +286,8 @@ xfs_attr_set_int(
error = xfs_trans_reserve(args.trans, &tres, args.total, 0);
if (error) {
xfs_trans_cancel(args.trans, 0);
+ if (args.norm)
+ kmem_free((void *)args.norm);
return(error);
}
xfs_ilock(dp, XFS_ILOCK_EXCL);
@@ -286,6 +298,8 @@ xfs_attr_set_int(
if (error) {
xfs_iunlock(dp, XFS_ILOCK_EXCL);
xfs_trans_cancel(args.trans, XFS_TRANS_RELEASE_LOG_RES);
+ if (args.norm)
+ kmem_free((void *)args.norm);
return (error);
}

@@ -333,7 +347,8 @@ xfs_attr_set_int(
err2 = xfs_trans_commit(args.trans,
XFS_TRANS_RELEASE_LOG_RES);
xfs_iunlock(dp, XFS_ILOCK_EXCL);
-
+ if (args.norm)
+ kmem_free((void *)args.norm);
return(error == 0 ? err2 : error);
}

@@ -398,6 +413,8 @@ xfs_attr_set_int(
xfs_trans_log_inode(args.trans, dp, XFS_ILOG_CORE);
error = xfs_trans_commit(args.trans, XFS_TRANS_RELEASE_LOG_RES);
xfs_iunlock(dp, XFS_ILOCK_EXCL);
+ if (args.norm)
+ kmem_free((void *)args.norm);

return(error);

@@ -406,6 +423,9 @@ out:
xfs_trans_cancel(args.trans,
XFS_TRANS_RELEASE_LOG_RES|XFS_TRANS_ABORT);
xfs_iunlock(dp, XFS_ILOCK_EXCL);
+ if (args.norm)
+ kmem_free((void *)args.norm);
+
return(error);
}

@@ -452,12 +472,15 @@ xfs_attr_remove_int(xfs_inode_t *dp, struct xfs_name *name, int flags)
args.name = name->name;
args.namelen = name->len;
args.flags = flags;
- args.hashval = xfs_da_hashname(args.name, args.namelen);
args.dp = dp;
args.firstblock = &firstblock;
args.flist = &flist;
args.total = 0;
args.whichfork = XFS_ATTR_FORK;
+ if (! xfs_sb_version_hasutf8(&mp->m_sb))
+ args.hashval = xfs_da_hashname(args.name, args.namelen);
+ else if ((error = mp->m_dirnameops->normhash(&args)) != 0)
+ return error;

/*
* we have no control over the attribute names that userspace passes us
@@ -470,8 +493,11 @@ xfs_attr_remove_int(xfs_inode_t *dp, struct xfs_name *name, int flags)
* Attach the dquots to the inode.
*/
error = xfs_qm_dqattach(dp, 0);
- if (error)
- return error;
+ if (error) {
+ if (args.norm)
+ kmem_free((void *)args.norm);
+ return error;
+ }

/*
* Start our first transaction of the day.
@@ -497,6 +523,8 @@ xfs_attr_remove_int(xfs_inode_t *dp, struct xfs_name *name, int flags)
XFS_ATTRRM_SPACE_RES(mp), 0);
if (error) {
xfs_trans_cancel(args.trans, 0);
+ if (args.norm)
+ kmem_free((void *)args.norm);
return(error);
}

@@ -546,6 +574,8 @@ xfs_attr_remove_int(xfs_inode_t *dp, struct xfs_name *name, int flags)
xfs_trans_log_inode(args.trans, dp, XFS_ILOG_CORE);
error = xfs_trans_commit(args.trans, XFS_TRANS_RELEASE_LOG_RES);
xfs_iunlock(dp, XFS_ILOCK_EXCL);
+ if (args.norm)
+ kmem_free((void *)args.norm);

return(error);

@@ -554,6 +584,9 @@ out:
xfs_trans_cancel(args.trans,
XFS_TRANS_RELEASE_LOG_RES|XFS_TRANS_ABORT);
xfs_iunlock(dp, XFS_ILOCK_EXCL);
+ if (args.norm)
+ kmem_free((void *)args.norm);
+
return(error);
}

diff --git a/libxfs/xfs_attr_leaf.c b/libxfs/xfs_attr_leaf.c
index f7f02ae..052a6a1 100644
--- a/libxfs/xfs_attr_leaf.c
+++ b/libxfs/xfs_attr_leaf.c
@@ -634,6 +634,7 @@ int
xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
{
xfs_inode_t *dp;
+ struct xfs_mount *mp;
xfs_attr_shortform_t *sf;
xfs_attr_sf_entry_t *sfe;
xfs_da_args_t nargs;
@@ -646,6 +647,7 @@ xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
trace_xfs_attr_sf_to_leaf(args);

dp = args->dp;
+ mp = dp->i_mount;
ifp = dp->i_afp;
sf = (xfs_attr_shortform_t *)ifp->if_u1.if_data;
size = be16_to_cpu(sf->hdr.totsize);
@@ -698,13 +700,18 @@ xfs_attr_shortform_to_leaf(xfs_da_args_t *args)
nargs.namelen = sfe->namelen;
nargs.value = &sfe->nameval[nargs.namelen];
nargs.valuelen = sfe->valuelen;
- nargs.hashval = xfs_da_hashname(sfe->nameval,
- sfe->namelen);
nargs.flags = XFS_ATTR_NSP_ONDISK_TO_ARGS(sfe->flags);
+ if (! xfs_sb_version_hasutf8(&mp->m_sb))
+ nargs.hashval = xfs_da_hashname(sfe->nameval,
+ sfe->namelen);
+ else if ((error = mp->m_dirnameops->normhash(&nargs)) != 0)
+ goto out;
error = xfs_attr3_leaf_lookup_int(bp, &nargs); /* set a->index */
ASSERT(error == ENOATTR);
error = xfs_attr3_leaf_add(bp, &nargs);
ASSERT(error != ENOSPC);
+ if (nargs.norm)
+ kmem_free((void *)nargs.norm);
if (error)
goto out;
sfe = XFS_ATTR_SF_NEXTENTRY(sfe);
diff --git a/libxfs/xfs_utf8.c b/libxfs/xfs_utf8.c
index f5cc231..5c69591 100644
--- a/libxfs/xfs_utf8.c
+++ b/libxfs/xfs_utf8.c
@@ -31,6 +31,7 @@
#include "xfs_inode_fork.h"
#include "xfs_bmap.h"
#include "xfs_dir2.h"
+#include "xfs_attr_leaf.h"
#include "xfs_trace.h"
#include "xfs_utf8.h"
#include "utf8norm.h"
@@ -72,6 +73,9 @@ xfs_utf8_normhash(
ssize_t normlen;
int c;

+ /* Don't normalize system attribute names. */
+ if (args->flags & (ATTR_ROOT|ATTR_SECURE))
+ goto blob;
nfkdi = utf8nfkdi(utf8version);
/* Failure to normalize is treated as a blob. */
if ((normlen = utf8nlen(nfkdi, (const char *)args->name,
@@ -173,6 +177,9 @@ xfs_utf8_ci_normhash(
ssize_t normlen;
int c;

+ /* Don't normalize system attribute names. */
+ if (args->flags & (ATTR_ROOT|ATTR_SECURE))
+ goto blob;
nfkdicf = utf8nfkdicf(utf8version);
/* Failure to normalize is treated as a blob. */
if ((normlen = utf8nlen(nfkdicf, (const char *)args->name,
--
1.7.12.4
Ben Myers
2014-09-11 21:02:54 UTC
Permalink
From: Mark Tinguely <***@sgi.com>

Add utf-8 to xfs_growfs and xfs_info.

Signed-off-by: Mark Tinguely <***@sgi.com>
---
growfs/xfs_growfs.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/growfs/xfs_growfs.c b/growfs/xfs_growfs.c
index 8e611b6..6c41803 100644
--- a/growfs/xfs_growfs.c
+++ b/growfs/xfs_growfs.c
@@ -57,7 +57,8 @@ report_info(
int crcs_enabled,
int cimode,
int ftype_enabled,
- int finobt_enabled)
+ int finobt_enabled,
+ int utf8)
{
printf(_(
"meta-data=%-22s isize=%-6u agcount=%u, agsize=%u blks\n"
@@ -65,7 +66,7 @@ report_info(
" =%-22s crc=%-8u finobt=%u\n"
"data =%-22s bsize=%-6u blocks=%llu, imaxpct=%u\n"
" =%-22s sunit=%-6u swidth=%u blks\n"
- "naming =version %-14u bsize=%-6u ascii-ci=%d ftype=%d\n"
+ "naming =version %-14u bsize=%-6u ascii-ci=%d ftype=%d utf8=%d\n"
"log =%-22s bsize=%-6u blocks=%u, version=%u\n"
" =%-22s sectsz=%-5u sunit=%u blks, lazy-count=%u\n"
"realtime =%-22s extsz=%-6u blocks=%llu, rtextents=%llu\n"),
@@ -76,7 +77,7 @@ report_info(
"", geo.blocksize, (unsigned long long)geo.datablocks,
geo.imaxpct,
"", geo.sunit, geo.swidth,
- dirversion, geo.dirblocksize, cimode, ftype_enabled,
+ dirversion, geo.dirblocksize, cimode, ftype_enabled, utf8,
isint ? _("internal") : logname ? logname : _("external"),
geo.blocksize, geo.logblocks, logversion,
"", geo.logsectsize, geo.logsunit / geo.blocksize, lazycount,
@@ -114,6 +115,7 @@ main(int argc, char **argv)
long long rsize; /* new rt size in fs blocks */
int ci; /* ASCII case-insensitive fs */
int lazycount; /* lazy superblock counters */
+ int utf8; /* Unicode chars supported */
int xflag; /* -x flag */
char *fname; /* mount point name */
char *datadev; /* data device name */
@@ -247,11 +249,12 @@ main(int argc, char **argv)
crcs_enabled = geo.flags & XFS_FSOP_GEOM_FLAGS_V5SB ? 1 : 0;
ftype_enabled = geo.flags & XFS_FSOP_GEOM_FLAGS_FTYPE ? 1 : 0;
finobt_enabled = geo.flags & XFS_FSOP_GEOM_FLAGS_FINOBT ? 1 : 0;
+ utf8 = geo.flags & XFS_FSOP_GEOM_FLAGS_UTF8 ? 1 : 0;
if (nflag) {
report_info(geo, datadev, isint, logdev, rtdev,
lazycount, dirversion, logversion,
attrversion, projid32bit, crcs_enabled, ci,
- ftype_enabled, finobt_enabled);
+ ftype_enabled, finobt_enabled, utf8);
exit(0);
}

@@ -289,7 +292,7 @@ main(int argc, char **argv)
report_info(geo, datadev, isint, logdev, rtdev,
lazycount, dirversion, logversion,
attrversion, projid32bit, crcs_enabled, ci, ftype_enabled,
- finobt_enabled);
+ finobt_enabled, utf8);

ddsize = xi.dsize;
dlsize = ( xi.logBBsize? xi.logBBsize :
--
1.7.12.4
Ben Myers
2014-09-11 21:03:43 UTC
Permalink
From: Mark Tinguely <***@sgi.com>

Set the utf-8 feature bit.

Signed-off-by: Mark Tinguely <***@sgi.com>
---
man/man8/mkfs.xfs.8 | 9 ++++++++-
mkfs/xfs_mkfs.c | 27 ++++++++++++++++++++++-----
mkfs/xfs_mkfs.h | 3 ++-
3 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/man/man8/mkfs.xfs.8 b/man/man8/mkfs.xfs.8
index ad9ff3d..aa43cf5 100644
--- a/man/man8/mkfs.xfs.8
+++ b/man/man8/mkfs.xfs.8
@@ -558,7 +558,7 @@ any power of 2 size from the filesystem block size up to 65536.
.IP
The
.B version=ci
-option enables ASCII only case-insensitive filename lookup and version
+option enables ASCII or UTF-8 case-insensitive filename lookup and version
2 directories. Filenames are case-preserving, that is, the names
are stored in directories using the case they were created with.
.IP
@@ -582,6 +582,13 @@ When CRCs are enabled via
the ftype functionality is always enabled. This feature can not be turned
off for such filesystem configurations.
.IP
+.TP
+.BI utf8[= value ]
+This is used to enable the UTF-8 character set support. The
+.I value
+is either 0 or 1, with 1 signifying that UTF-8 character support is to be
+enabled. If the value is omitted, 1 is assumed.
+.IP
.RE
.TP
.BI \-p " protofile"
diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index c85258a..1829e51 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -149,6 +149,8 @@ char *nopts[] = {
"version",
#define N_FTYPE 3
"ftype",
+#define N_UTF8 4
+ "utf8",
NULL,
};

@@ -958,6 +960,7 @@ main(
int nsflag;
int nvflag;
int nci;
+ int utf8;
int Nflag;
int discard = 1;
char *p;
@@ -1004,6 +1007,7 @@ main(
logagno = logblocks = rtblocks = rtextblocks = 0;
Nflag = nlflag = nsflag = nvflag = nci = 0;
nftype = dirftype = 0; /* inode type information in the dir */
+ utf8 = 0; /* utf-8 support */
dirblocklog = dirblocksize = 0;
dirversion = XFS_DFL_DIR_VERSION;
qflag = 0;
@@ -1565,7 +1569,8 @@ _("cannot specify both crc and ftype\n"));
if (nvflag)
respec('n', nopts, N_VERSION);
if (!strcasecmp(value, "ci")) {
- nci = 1; /* ASCII CI mode */
+ /* ASCII or UTF-8 CI mode */
+ nci = 1;
} else {
dirversion = atoi(value);
if (dirversion != 2)
@@ -1587,6 +1592,14 @@ _("cannot specify both crc and ftype\n"));
}
nftype = 1;
break;
+ case N_UTF8:
+ if (!value || *value == '\0')
+ value = "1";
+ c = atoi(value);
+ if (c < 0 || c > 1)
+ illegal(value, "n utf8");
+ utf8 = c;
+ break;
default:
unknown('n', value);
}
@@ -2460,7 +2473,8 @@ _("size %s specified for log subvolume is too large, maximum is %lld blocks\n"),
*/
sbp->sb_features2 = XFS_SB_VERSION2_MKFS(crcs_enabled, lazy_sb_counters,
attrversion == 2, !projid16bit, 0,
- (!crcs_enabled && dirftype));
+ (!crcs_enabled && dirftype),
+ (!crcs_enabled && utf8));
sbp->sb_versionnum = XFS_SB_VERSION_MKFS(crcs_enabled, iaflag,
dsunit != 0,
logversion == 2, attrversion == 1,
@@ -2534,6 +2548,9 @@ _("size %s specified for log subvolume is too large, maximum is %lld blocks\n"),
if (crcs_enabled) {
sbp->sb_features_incompat = XFS_SB_FEAT_INCOMPAT_FTYPE;
dirftype = 1;
+ /* turn on the utf-8 support */
+ if (utf8)
+ sbp->sb_features_incompat |= XFS_SB_FEAT_INCOMPAT_UTF8;
}

if (!qflag || Nflag) {
@@ -2543,7 +2560,7 @@ _("size %s specified for log subvolume is too large, maximum is %lld blocks\n"),
" =%-22s crc=%-8u finobt=%u\n"
"data =%-22s bsize=%-6u blocks=%llu, imaxpct=%u\n"
" =%-22s sunit=%-6u swidth=%u blks\n"
- "naming =version %-14u bsize=%-6u ascii-ci=%d ftype=%d\n"
+ "naming =version %-14u bsize=%-6u ascii-ci=%d ftype=%d utf8=%d\n"
"log =%-22s bsize=%-6d blocks=%lld, version=%d\n"
" =%-22s sectsz=%-5u sunit=%d blks, lazy-count=%d\n"
"realtime =%-22s extsz=%-6d blocks=%lld, rtextents=%lld\n"),
@@ -2552,7 +2569,7 @@ _("size %s specified for log subvolume is too large, maximum is %lld blocks\n"),
"", crcs_enabled, finobt,
"", blocksize, (long long)dblocks, imaxpct,
"", dsunit, dswidth,
- dirversion, dirblocksize, nci, dirftype,
+ dirversion, dirblocksize, nci, dirftype, utf8,
logfile, 1 << blocklog, (long long)logblocks,
logversion, "", lsectorsize, lsunit, lazy_sb_counters,
rtfile, rtextblocks << blocklog,
@@ -3171,7 +3188,7 @@ usage( void )
sunit=value|su=num,sectlog=n|sectsize=num,\n\
lazy-count=0|1]\n\
/* label */ [-L label (maximum 12 characters)]\n\
-/* naming */ [-n log=n|size=num,version=2|ci,ftype=0|1]\n\
+/* naming */ [-n log=n|size=num,version=2|ci,ftype=0|1,utf8=0|1]\n\
/* no-op info only */ [-N]\n\
/* prototype file */ [-p fname]\n\
/* quiet */ [-q]\n\
diff --git a/mkfs/xfs_mkfs.h b/mkfs/xfs_mkfs.h
index 9df5f37..f40b284 100644
--- a/mkfs/xfs_mkfs.h
+++ b/mkfs/xfs_mkfs.h
@@ -37,13 +37,14 @@
0 ) : XFS_SB_VERSION_1 )

#define XFS_SB_VERSION2_MKFS(crc, lazycount, attr2, projid32bit, parent, \
- ftype) (\
+ ftype, utf8) (\
((lazycount) ? XFS_SB_VERSION2_LAZYSBCOUNTBIT : 0) | \
((attr2) ? XFS_SB_VERSION2_ATTR2BIT : 0) | \
((projid32bit) ? XFS_SB_VERSION2_PROJID32BIT : 0) | \
((parent) ? XFS_SB_VERSION2_PARENTBIT : 0) | \
((crc) ? XFS_SB_VERSION2_CRCBIT : 0) | \
((ftype) ? XFS_SB_VERSION2_FTYPE : 0) | \
+ ((utf8) ? XFS_SB_VERSION2_UTF8BIT : 0) | \
0 )

#define XFS_DFL_BLOCKSIZE_LOG 12 /* 4096 byte blocks */
--
1.7.12.4
Ben Myers
2014-09-11 21:04:22 UTC
Permalink
From: Mark Tinguely <***@sgi.com>

Fix the duplicate filename detection to use the utf-8 normalization
routines.

Signed-off-by: Mark Tinguely <***@sgi.com>
---
repair/phase6.c | 35 +++++++++++++++++++++++++----------
1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/repair/phase6.c b/repair/phase6.c
index f374fd0..eb3ea35 100644
--- a/repair/phase6.c
+++ b/repair/phase6.c
@@ -176,13 +176,15 @@ dir_hash_add(
unsigned char *name,
__uint8_t ftype)
{
- xfs_dahash_t hash = 0;
int byaddr;
int byhash = 0;
dir_hash_ent_t *p;
int dup;
short junk;
struct xfs_name xname;
+ xfs_da_args_t args;
+
+ memset(&args, 0, sizeof(xfs_da_args_t));

ASSERT(!hashtab->names_duped);

@@ -195,19 +197,30 @@ dir_hash_add(
dup = 0;

if (!junk) {
- hash = mp->m_dirnameops->hashname(name, namelen);
- byhash = DIR_HASH_FUNC(hashtab, hash);
+ int error;
+
+ args.name = name;
+ args.namelen = namelen;
+ args.inumber = inum;
+ args.whichfork = XFS_DATA_FORK;
+
+ error = mp->m_dirnameops->normhash(&args);
+ if (error)
+ do_error(_("normalize has failed %d)\n"), error);
+
+ byhash = DIR_HASH_FUNC(hashtab, args.hashval);

/*
* search hash bucket for existing name.
*/
for (p = hashtab->byhash[byhash]; p; p = p->nextbyhash) {
- if (p->hashval == hash && p->name.len == namelen) {
- if (memcmp(p->name.name, name, namelen) == 0) {
- dup = 1;
- junk = 1;
- break;
- }
+ if (p->hashval == args.hashval &&
+ mp->m_dirnameops->compname(&args, p->name.name,
+ p->name.len) !=
+ XFS_CMP_DIFFERENT) {
+ dup = 1;
+ junk = 1;
+ break;
}
}
}
@@ -226,7 +239,7 @@ dir_hash_add(
hashtab->last = p;

if (!(p->junkit = junk)) {
- p->hashval = hash;
+ p->hashval = args.hashval;
p->nextbyhash = hashtab->byhash[byhash];
hashtab->byhash[byhash] = p;
}
@@ -235,6 +248,8 @@ dir_hash_add(
p->seen = 0;
p->name = xname;

+ if (args.norm)
+ kmem_free((void *) args.norm);
return !dup;
}
--
1.7.12.4
Ben Myers
2014-09-11 21:06:17 UTC
Permalink
From: Ben Myers <***@sgi.com>

Here's a preliminary test for utf8 support in xfs. It is based on
Olaf's code that does some testing in the trie generator. Here too we
are using the NormalizationTest.txt file from the unicode distribution.
We check that the normalization in libxfs is working and then run checks
on a filesystem. Note that there are some 'blacklisted' unichars which
normalize to reserved characters.

FIXME:

For convenience of build this patch is against xfsprogs access to
libxfs. Handling of ignorables and case fold is also not implemented
here.

---
Makefile | 2 +-
chkutf8data/Makefile | 21 +++
chkutf8data/chkutf8data.c | 430 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 452 insertions(+), 1 deletion(-)
create mode 100644 chkutf8data/Makefile
create mode 100644 chkutf8data/chkutf8data.c

diff --git a/Makefile b/Makefile
index c442da6..d4c0a23 100644
--- a/Makefile
+++ b/Makefile
@@ -42,7 +42,7 @@ endif

LIB_SUBDIRS = support libxfs libxlog libxcmd libhandle libdisk
TOOL_SUBDIRS = copy db estimate fsck fsr growfs io logprint mkfs quota \
- mdrestore repair rtcp m4 man doc po debian
+ mdrestore repair rtcp m4 man doc po debian chkutf8data

SUBDIRS = include $(LIB_SUBDIRS) $(TOOL_SUBDIRS)

diff --git a/chkutf8data/Makefile b/chkutf8data/Makefile
new file mode 100644
index 0000000..6ce5706
--- /dev/null
+++ b/chkutf8data/Makefile
@@ -0,0 +1,21 @@
+#
+# Copyright (c) 2014 SGI. All Rights Reserved.
+#
+
+TOPDIR = ..
+include $(TOPDIR)/include/builddefs
+
+LTCOMMAND = chkutf8data
+CFILES = chkutf8data.c
+
+LLDLIBS = $(LIBXFS)
+LTDEPENDENCIES = $(LIBXFS)
+LLDFLAGS = -static
+
+default: depend $(LTCOMMAND)
+
+include $(BUILDRULES)
+
+install: default
+
+-include .ltdep
diff --git a/chkutf8data/chkutf8data.c b/chkutf8data/chkutf8data.c
new file mode 100644
index 0000000..487cf1e
--- /dev/null
+++ b/chkutf8data/chkutf8data.c
@@ -0,0 +1,430 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+#include <sys/types.h>
+#include <stddef.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <assert.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+#include <fcntl.h>
+#include "utf8norm.h"
+
+#define FOLD_NAME "CaseFolding.txt"
+#define TEST_NAME "NormalizationTest.txt"
+
+const char *fold_name = FOLD_NAME;
+const char *test_name = TEST_NAME;
+
+/* An arbitrary line size limit on input lines. */
+
+#define LINESIZE 1024
+char line[LINESIZE];
+char buf0[LINESIZE];
+char buf1[LINESIZE];
+char buf2[LINESIZE];
+char buf3[LINESIZE];
+char buf4[LINESIZE];
+char buf5[LINESIZE];
+
+const char *mtpt;
+int verbose = 0;
+
+/* ------------------------------------------------------------------ */
+
+static void
+help(void)
+{
+ printf("The input files:\n");
+ printf("\t-f %s\n", FOLD_NAME);
+ printf("\t-t %s\n", TEST_NAME);
+ printf("\n\n");
+ printf("\t-m mtpt\n");
+ printf("\t-v (verbose)\n");
+ printf("\t-h (help)\n");
+ printf("\n");
+}
+
+static void
+usage(void)
+{
+ help();
+ exit(1);
+}
+
+static void
+open_fail(const char *name, int error)
+{
+ printf("Error %d opening %s: %s\n", error, name, strerror(error));
+ exit(1);
+}
+
+static void
+file_fail(const char *filename)
+{
+ printf("Error parsing %s\n", filename);
+ exit(1);
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * UTF8 valid ranges.
+ *
+ * The UTF-8 encoding spreads the bits of a 32bit word over several
+ * bytes. This table gives the ranges that can be held and how they'd
+ * be represented.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * There is an additional requirement on UTF-8, in that only the
+ * shortest representation of a 32bit value is to be used. A decoder
+ * must not decode sequences that do not satisfy this requirement.
+ * Thus the allowed ranges have a lower bound.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * Actual unicode characters are limited to the range 0x0 - 0x10FFFF,
+ * 17 planes of 65536 values. This limits the sequences actually seen
+ * even more, to just the following.
+ *
+ * 0 - 0x7f: 0 0x7f
+ * 0x80 - 0x7ff: 0xc2 0x80 0xdf 0xbf
+ * 0x800 - 0xffff: 0xe0 0xa0 0x80 0xef 0xbf 0xbf
+ * 0x10000 - 0x10ffff: 0xf0 0x90 0x80 0x80 0xf4 0x8f 0xbf 0xbf
+ *
+ * Even within those ranges not all values are allowed: the surrogates
+ * 0xd800 - 0xdfff should never be seen.
+ *
+ * Note that the longest sequence seen with valid usage is 4 bytes,
+ * the same a single UTF-32 character. This makes the UTF-8
+ * representation of Unicode strictly smaller than UTF-32.
+ *
+ * The shortest sequence requirement was introduced by:
+ * Corrigendum #1: UTF-8 Shortest Form
+ * It can be found here:
+ * http://www.unicode.org/versions/corrigendum1.html
+ *
+ */
+
+#define UTF8_2_BITS 0xC0
+#define UTF8_3_BITS 0xE0
+#define UTF8_4_BITS 0xF0
+#define UTF8_N_BITS 0x80
+#define UTF8_2_MASK 0xE0
+#define UTF8_3_MASK 0xF0
+#define UTF8_4_MASK 0xF8
+#define UTF8_N_MASK 0xC0
+#define UTF8_V_MASK 0x3F
+#define UTF8_V_SHIFT 6
+
+static int
+utf8key(unsigned int key, char keyval[])
+{
+ int keylen;
+
+ if (key < 0x80) {
+ keyval[0] = key;
+ keylen = 1;
+ } else if (key < 0x800) {
+ keyval[1] = key & UTF8_V_MASK;
+ keyval[1] |= UTF8_N_BITS;
+ key >>= UTF8_V_SHIFT;
+ keyval[0] = key;
+ keyval[0] |= UTF8_2_BITS;
+ keylen = 2;
+ } else if (key < 0x10000) {
+ keyval[2] = key & UTF8_V_MASK;
+ keyval[2] |= UTF8_N_BITS;
+ key >>= UTF8_V_SHIFT;
+ keyval[1] = key & UTF8_V_MASK;
+ keyval[1] |= UTF8_N_BITS;
+ key >>= UTF8_V_SHIFT;
+ keyval[0] = key;
+ keyval[0] |= UTF8_3_BITS;
+ keylen = 3;
+ } else if (key < 0x110000) {
+ keyval[3] = key & UTF8_V_MASK;
+ keyval[3] |= UTF8_N_BITS;
+ key >>= UTF8_V_SHIFT;
+ keyval[2] = key & UTF8_V_MASK;
+ keyval[2] |= UTF8_N_BITS;
+ key >>= UTF8_V_SHIFT;
+ keyval[1] = key & UTF8_V_MASK;
+ keyval[1] |= UTF8_N_BITS;
+ key >>= UTF8_V_SHIFT;
+ keyval[0] = key;
+ keyval[0] |= UTF8_4_BITS;
+ keylen = 4;
+ } else {
+ printf("%#x: illegal key\n", key);
+ keylen = 0;
+ }
+ return keylen;
+}
+
+static int
+normalize_line(utf8data_t tree, char *s, char *t)
+{
+ struct utf8cursor u8c;
+
+ if (utf8cursor(&u8c, tree, s)) {
+ printf("%s return utf8cursor failed\n", __func__);
+ return -1;
+ }
+
+ while ((*t = utf8byte(&u8c)) > 0)
+ t++;
+
+ if (*t < 0) {
+ printf("%s return error %d\r", __func__, *t);
+ return -1;
+ }
+ if (*t != 0) {
+ printf("%s return t not 0\n", __func__);
+ return -1;
+ }
+
+ return 0;
+}
+
+static void
+test_key(char *source,
+ char *NFC,
+ char *NFD,
+ char *NFKC,
+ char *NFKD)
+{
+ int fd;
+ int error;
+
+ if (verbose)
+ printf("Testing %s -> %s\n", source, NFKD);
+
+ error = chdir(mtpt); /* XXX hardcoded mount point */
+ if (error) {
+ perror(mtpt);
+ exit(-1);
+ }
+
+ /* the initial create should succeed */
+ if (verbose)
+ printf("Initial create %s... ", source);
+ fd = open(source, O_CREAT|O_EXCL, 0);
+ if (fd < 0) {
+ printf("Failed to create %s XXX\n", source);
+ perror(source);
+ close(fd);
+ exit(-1);
+ }
+ close(fd);
+ if (verbose)
+ printf("Success\n");
+
+ /* a second create should fail */
+ if (verbose)
+ printf("Second create %s (should return EEXIST)... ", NFKD);
+ fd = open(NFKD, O_CREAT|O_EXCL, 0);
+ if (fd >= 1) {
+ printf("Test Failed. Was able to create %s XXX\n", NFKD);
+ perror(NFKD);
+ close(fd);
+ exit(-1);
+ }
+ close(fd);
+ if (verbose)
+ printf("EEXIST\n");
+
+ error = unlink(NFKD);
+ if (error) {
+ printf("Unlink failed\n");
+ perror(NFKD);
+ exit(-1);
+ }
+}
+
+int
+blacklisted(unsigned int unichar)
+{
+ /* these unichars normalize to characters we don't allow */
+ unsigned int list[] = { 0x2024 /* . */,
+ 0x2025 /* .. */,
+ 0x2100 /* a/c */,
+ 0x2101 /* a/s */,
+ 0x2105 /* c/o */,
+ 0x2106 /* c/u */,
+ 0xFE30 /* .. */,
+ 0xFE52 /* . */,
+ 0xFF0E /* . */,
+ 0xFF0F /* / */};
+ int i;
+
+ for (i=0; i < (sizeof(list) / sizeof(unichar)); i++) {
+ if (list[i] == unichar)
+ return 1;
+ }
+ return 0;
+}
+
+static void
+normalization_test(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ char *s;
+ char *t;
+ int ret;
+ int tests = 0;
+ int failures = 0;
+ char source[LINESIZE];
+ char NFKD[LINESIZE];
+ int skip;
+ utf8data_t nfkdi = utf8nfkdi(utf8version);
+
+ printf("Parsing %s\n", test_name);
+ /* Step one, read data from file. */
+ file = fopen(test_name, "r");
+ if (!file)
+ open_fail(test_name, errno);
+
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%[^;];%*[^;];%*[^;];%*[^;];%[^;];",
+ source, NFKD);
+ //NFC, NFD, NFKC, NFKD);
+ if (ret != 2 || *line == '#')
+ continue;
+
+ s = source;
+ t = buf2;
+ skip = 0;
+ while (*s) {
+ unichar = strtoul(s, &s, 16);
+ if (blacklisted(unichar))
+ skip++;
+ t += utf8key(unichar, t);
+ }
+ *t = '\0';
+
+ if (skip)
+ continue;
+
+ s = NFKD;
+ t = buf3;
+ while (*s) {
+ unichar = strtoul(s, &s, 16);
+ t += utf8key(unichar, t);
+ }
+ *t = '\0';
+
+ /* normalize source */
+ if (normalize_line(nfkdi, buf2, buf4) < 0) {
+ printf("normalize_line for unichar %s Failed\n", buf0);
+ exit(1);
+ }
+ if (verbose)
+ printf("(%s) %s normalized to %s... ",
+ source, buf2, buf4);
+
+ /* does it match NFKD? */
+ tests++;
+ if (memcmp(buf4, buf3, strlen(buf3))) {
+ if (verbose)
+ printf("Fail!\n");
+ failures++;
+ } else {
+ if (verbose)
+ printf("Correct!\n");
+ }
+
+ /* normalize NFKD */
+ if (normalize_line(nfkdi, buf3, buf5) < 0) {
+ printf("normalize_line for unichar %s Failed\n",
+ buf3);
+ exit(1);
+ }
+ if (verbose)
+ printf("(%s) %s normalized to %s... ",
+ NFKD, buf3, buf5);
+
+ /* does it normalize to itself? */
+ tests++;
+ if (memcmp(buf5, buf3, strlen(buf3))) {
+ if (verbose)
+ printf("Fail!\n");
+ failures++;
+ } else {
+ if (verbose)
+ printf("Correct!\n");
+ }
+
+ /* XXX ignorables need to be taken into account? */
+ test_key(buf2, NULL, NULL, NULL, buf3);
+ }
+ fclose(file);
+ printf("Ran %d tests with %d failures\n", tests, failures);
+ if (failures)
+ file_fail(test_name);
+}
+
+int
+main(int argc, char *argv[])
+{
+ int opt;
+
+ while ((opt = getopt(argc, argv, "f:t:m:vh")) != -1) {
+ switch (opt) {
+ case 'f':
+ fold_name = optarg;
+ break;
+ case 't':
+ test_name = optarg;
+ break;
+ case 'm':
+ mtpt = optarg;
+ break;
+ case 'v':
+ verbose++;
+ break;
+ case 'h':
+ help();
+ exit(0);
+ default:
+ usage();
+ }
+ }
+
+ if (!test_name || !mtpt) {
+ usage();
+ exit(-1);
+ }
+
+ normalization_test();
+
+ return 0;
+}
--
1.7.12.4
Dave Chinner
2014-09-12 10:02:30 UTC
Permalink
Post by Ben Myers
Hi,
I'm posting this RFC on Olaf's behalf, as he is busy with other projects.
Ok, but I'd prefer to have Olaf discuss the finer points rather than
have to play chinese whispers through you. :/
Post by Ben Myers
First is a series of kernel patches, then a series of patches for
xfsprogs, and then a test.
Seeing as this is something out of the blue (i.e. nobody has made a
mention of this functionality in the past couple of years), I think
we need to look at design and architecture first before spending any
time commenting on the code.
Post by Ben Myers
Note that I have removed the unicode database files prior to posting due
to their large size. There are instructions on how to download them in
the relevant commit headers.
Which leads to an interesting issue: these files do not have
cryptographically verifiable signatures. How can I trust them? I
can't even access unicode.org via https, so I can't even be certain
that I'm downloading from the site I think I'm downloading from....
Post by Ben Myers
-----------------------------------------------------------------------------
Unicode/UTF-8 support for XFS
So we had a customer request proper unicode support...
Design notes.
XFS uses byte strings for filenames, so UTF-8 is the expected format for
unicode filenames. This does raise the question what criteria a byte string
- Valid unicode code points are 0..0x10FFFF, except that
- The surrogates 0xD800..0xDFFF are not valid code points, and
- Valid UTF-8 must be a shortest encoding of a valid unicode code point.
In addition, U+0 (ASCII NUL, '\0') is used to terminate byte strings (and
is itself not part of the string). Moreover strings may be length-limited
in addition to being NUL-terminated (there is no such thing as an embedded
NUL in a length-limited string).
Based on feedback on the earlier patches for unicode/UTF-8 support, we
References, please. I don't recall any series discussion on this
topic since Barry posted the unicode-CI patches back in 2008, and I
doubt anyone remembers the details of those discussions....
Post by Ben Myers
decided that a filename that does not match the above criteria should be
treated as a binary blob, as opposed to being rejected. To stress: if any
part of the string isn't valid UTF-8, then the entire string is treated
as a binary blob. This matters once normalization is considered.
So we accept invalid unicode in filenames, but only after failing to
parse them? Isn't this a potential vector for exploiting weaknesses
in application filename handling? i.e. unprivileged user writes
specially crafted invalid unicode filename to disk, setuid program
tries to parse it, invalid sequence triggers a buffer overflow bug
in setuid parser?
Post by Ben Myers
we must compare the normalized forms of strings, not just the raw sequences
of bytes. There are a number of defined normalization forms for unicode.
We decided on a variant of NFKD we call NFKDI. NFD was chosed over NFC,
because calculating NFC requires calculating NFD first, followed by an
additional step. NFKD was chosen over NFD because this makes filenames
that ought to be equal compare as equal.
But are they really equal?

Choosing *compatibility* decomposition over *canonical*
decomposition means that compound characters and formatting
distinctions don't affect the hash. i.e. "of'fi'ce", "o'ffi'ce" and
"office" all hash and compare as the same name, but then they get
stored on disk unnormalised. So they are the "same" in memory, but
very different on disk.

I note that the unicode spec says this for normalised forms
(11.1):

"A normalized string is guaranteed to be stable; that is, once
normalized, a string is normalized according to all future versions
of Unicode."

So if we store normalised strings on disk, they are guaranteed to
be compatible with all future versions of unicode and anything that
goes to use them. So why wouldn't we store normalised forms on disk?

As another point to note and discuss, from the unicode standard:

"Normalization Forms KC and KD must not be blindly applied to
arbitrary text. [...] It is best to think of these Normalization
Forms as being like uppercase or lowercase mappings: useful in
certain contexts for identifying core meanings, but also performing
modifications to the text that may not always be appropriate."

I'd consider file names to be mostly "arbitrary text" - we currently
treat them as opaque blobs and don't try to interpret them (apart
from '/' delimiters) and so they can contain arbitrary text....
Post by Ben Myers
My favorite example is the ways
"office" can be spelled, when "fi" or "ffi" ligatures are used. NFKDI adds
one more step of NFKD, in that it eliminates the code points that have the
Default_Ignorable_Code_Point property from the comparison. These code
points are as a rule invisible, but might (or might not) be pulled in when
you copy/paste a string to be used as a filename. An example of these is
U+00AD SOFT HYPHEN, a code point that only shows up if a word is split
across lines.
This extension does not appear to be specified by the unicode
standard - this seems like a dangerous thing to do when considering
compatibility with future unicode standards - we are not in the
business of extend-and-embrace here. Anyway, what happens if a
user actually wants a filename with a Default_Ignorable_Code_Point
character in it?

IMO, if cut-n-paste modifies the string being cut-n-pasted, then
that's a bug in the cut-n-paste application. I'd much prefer we use
a normalisation type that is defined by the standard than to invent
a new one to work around problems that may not even exist.
Post by Ben Myers
If a filename is considered to be binary blob, comparison is based on a
simple binary match. Normalization does not apply to any part of a blob.
See above: if we have unicode enabled, I think that we should reject
invalid unicode in filenames at normalisation time.
Post by Ben Myers
The code uses ("leverages", in corp-speak) the existing infrastructure for
case-insensitive filenames. Like the CI code, the name used to create a
file is stored on disk, and returned in a lookup. When comparing filenames
the normalized forms of the names being compared are generated on the fly
from the non-normalized forms stored on disk.
Again, why not store normalised forms on disk and avoid the need to
generate normalised forms for dirents being read from disk every
time they must be compared?
Post by Ben Myers
If the borgbit (the bit enabling legacy ASCII-based CI) is set in the
superblock, then case folding is added into the mix. This normalization
form we call NFKDICF. It allows for the creation of case-insensitive
filesystems with UTF-8 support.
Different languages have different case folding rules e.g. the upper
case character might be the same, but the lower case character is
different (or vice versa). Where are the language specific case
folding tables being stored? And speaking of language support, how
does this interact with the kernel NLS subsystem?
Post by Ben Myers
-----------------------------------------------------------------------------
Implementation notes.
Strings are normalized using a trie that stores the relevant information.
The trie itself is part of the XFS module, and about 250kB in size. The
trie is not checked in: instead we add the source files from the Unicode
Character Database and a program that creates the header containing the
trie.
This is rather unappealing. Distros would have to take this code
size penalty if they decide one user needs that support. The other
millions of users pay that cost even if they don't want it. And
then there's validation - how are we supposed to validate that a
250k binary blob is correct and free of issues on every compiler and
architecture that the kernel is built on?
Post by Ben Myers
The key for a lookup in the trie is a UTF-8 sequence. Each valid UTF-8
sequence leads to a leaf. No invalid sequence does. This means that trie
lookups can be used to validate UTF-8 sequences, which why there is no
specialized code for the same purpose.
The trie contains information for the version of unicode in which each
code point was defined. This matters because non-normalized strings are
stored on disk, and newer versions of unicode may introduce new normalized
forms. Ideally, the version of unicode used by the filesystem is stored in
the filesystem.
The trie also accounts for corrections made in the past to normalizations.
This has little value today, because any newly created filesystem would be
using unicode version 7.0.0. It is included in order to show, not tell,
that such corrections can be handled if they are added in future revisions.
And so back to the stability of normalised forms: if the normalised
forms are stable and the trie encodes the version of codepoints,
then the data in the leaves of the trie itself must be stable. i.e.
even for future versions of the standards, all the leaves that are
there now will be there in the future. What is valid unicode now
will remain valid unicode.

And given that, why do we need to carry the trie around in the
compiled kernel? We have a perfectly good mechanism for storing
large chunks of long-term stable metadata that we can access easily:
in files.

IOWs, the trie is really a property of the filesystem, not the
kernel or userspace tools. If we ever want to update to a new
version of unicode, we can compile a new trie and have mkfs write
that into new filesystems, and maybe add an xfs-reapir function that
allows migration to a new trie on an existing filesystem. But if we
carry it in the kernel then there will be interesting issues with
iupgrade/downgrade compatibility with new tries. Better to prevent
those simply by havingthe trie be owned by the filesystem, not the
kernel.

Hence I think the trie should probably be stored on disk in the
filesystem. It gets calculated and written by mkfs into file
attached to the superblock, and the only code that needs to go into
the kernel is the code needed to read it into memory and walk it.

That means we don't need 3,000 lines of nasty trie generation code
in the kernel, we don't bloat the kernel unnecessarily with abinary
blob, we don't need to build code with data from unverifiable
sources directly into the kernel, we can support different versions
of unicode easily, and so on.
Post by Ben Myers
The algorithm used to calculate the sequences of bytes for the normalized
form of a UTF-8 string is tricky. The core is found in utf8byte(), with an
explanation in the preceeding comment.
Precisely my point - it's nasty, tricky code, and getting it wrong
is a potential security vulnerability. Exactly how are we expected
to review >3,000 lines of unicode/utf-8 minutae without having to
become unicode encoding experts?

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Olaf Weber
2014-09-12 11:55:35 UTC
Permalink
Post by Dave Chinner
Post by Ben Myers
Hi,
I'm posting this RFC on Olaf's behalf, as he is busy with other projects.
Ok, but I'd prefer to have Olaf discuss the finer points rather than
have to play chinese whispers through you. :/
I am on this mailing list, and I am trying to follow along, but I do have
other calls on my time.
Post by Dave Chinner
Post by Ben Myers
First is a series of kernel patches, then a series of patches for
xfsprogs, and then a test.
Seeing as this is something out of the blue (i.e. nobody has made a
mention of this functionality in the past couple of years), I think
we need to look at design and architecture first before spending any
time commenting on the code.
Post by Ben Myers
Note that I have removed the unicode database files prior to posting due
to their large size. There are instructions on how to download them in
the relevant commit headers.
Which leads to an interesting issue: these files do not have
cryptographically verifiable signatures. How can I trust them? I
can't even access unicode.org via https, so I can't even be certain
that I'm downloading from the site I think I'm downloading from....
As Ben noted, the reason to not include them in these emails is their size:

$ wc fs/xfs/support/ucd/*
1273 12288 68009 fs/xfs/support/ucd/CaseFolding-7.0.0.txt
1470 14166 98263 fs/xfs/support/ucd/DerivedAge-7.0.0.txt
2368 22320 145072 fs/xfs/support/ucd/DerivedCombiningClass-7.0.0.txt
10794 123871 899859 fs/xfs/support/ucd/DerivedCoreProperties-7.0.0.txt
50 318 2040 fs/xfs/support/ucd/NormalizationCorrections-7.0.0.txt
18635 332441 2457187 fs/xfs/support/ucd/NormalizationTest-7.0.0.txt
33 86 1364 fs/xfs/support/ucd/README
27268 120686 1509570 fs/xfs/support/ucd/UnicodeData-7.0.0.txt
61891 626176 5181364 total

As for your remarks about cryptographic signatures, I'm not sure I see your
point there. Just to be clear: the idea is to check the files in, as opposed
to having to download them from unicode.org prior to compiling XFS.
Post by Dave Chinner
Post by Ben Myers
-----------------------------------------------------------------------------
Unicode/UTF-8 support for XFS
So we had a customer request proper unicode support...
Design notes.
XFS uses byte strings for filenames, so UTF-8 is the expected format for
unicode filenames. This does raise the question what criteria a byte string
- Valid unicode code points are 0..0x10FFFF, except that
- The surrogates 0xD800..0xDFFF are not valid code points, and
- Valid UTF-8 must be a shortest encoding of a valid unicode code point.
In addition, U+0 (ASCII NUL, '\0') is used to terminate byte strings (and
is itself not part of the string). Moreover strings may be length-limited
in addition to being NUL-terminated (there is no such thing as an embedded
NUL in a length-limited string).
Based on feedback on the earlier patches for unicode/UTF-8 support, we
References, please. I don't recall any series discussion on this
topic since Barry posted the unicode-CI patches back in 2008, and I
doubt anyone remembers the details of those discussions....
I looked up those discussions in the archives. For example, here's
Christoph about rejecting filenames if they're not well-formed unicode.
http://marc.info/?l=linux-fsdevel&m=120876935526856&w=2
And Jamie Lokier making a similar point:
http://oss.sgi.com/archives/xfs/2008-04/msg01263.html
Post by Dave Chinner
Post by Ben Myers
decided that a filename that does not match the above criteria should be
treated as a binary blob, as opposed to being rejected. To stress: if any
part of the string isn't valid UTF-8, then the entire string is treated
as a binary blob. This matters once normalization is considered.
So we accept invalid unicode in filenames, but only after failing to
parse them? Isn't this a potential vector for exploiting weaknesses
in application filename handling? i.e. unprivileged user writes
specially crafted invalid unicode filename to disk, setuid program
tries to parse it, invalid sequence triggers a buffer overflow bug
in setuid parser?
Yes, this means that userspace must be capable of handling filenames that
are not well-formed UTF-8 and a whole slew of other edge cases. Same as
today really.
Post by Dave Chinner
Post by Ben Myers
we must compare the normalized forms of strings, not just the raw sequences
of bytes. There are a number of defined normalization forms for unicode.
We decided on a variant of NFKD we call NFKDI. NFD was chosed over NFC,
because calculating NFC requires calculating NFD first, followed by an
additional step. NFKD was chosen over NFD because this makes filenames
that ought to be equal compare as equal.
But are they really equal?
Choosing *compatibility* decomposition over *canonical*
decomposition means that compound characters and formatting
distinctions don't affect the hash. i.e. "of'fi'ce", "o'ffi'ce" and
"office" all hash and compare as the same name, but then they get
stored on disk unnormalised. So they are the "same" in memory, but
very different on disk.
I note that the unicode spec says this for normalised forms
"A normalized string is guaranteed to be stable; that is, once
normalized, a string is normalized according to all future versions
of Unicode."
Provided no unassigned codepoints are present in that string.
Post by Dave Chinner
So if we store normalised strings on disk, they are guaranteed to
be compatible with all future versions of unicode and anything that
goes to use them. So why wouldn't we store normalised forms on disk?
Because, based what I read around the web, I expect a good deal of
resistance to the idea that a filesystem will on a lookup of a file you just
created return a name that is different-but-equivalent.

Think of it as the equivalent of being case-preserving for a
case-insensitive filesystem.

An alternative would be to store each filename twice: both raw and
normalized forms.
Post by Dave Chinner
"Normalization Forms KC and KD must not be blindly applied to
arbitrary text. [...] It is best to think of these Normalization
Forms as being like uppercase or lowercase mappings: useful in
certain contexts for identifying core meanings, but also performing
modifications to the text that may not always be appropriate."
I'd consider file names to be mostly "arbitrary text" - we currently
treat them as opaque blobs and don't try to interpret them (apart
from '/' delimiters) and so they can contain arbitrary text....
My reading of this part of the unicode standard is that applying a
compatibility normalization results in strings that materially differ from
the originals, and no full equivalent of the original can be reconstructed
from the normalized form. This makes it improper for a word processor to
normalize to NFKC or NFKD before saving a file.

For the same reason, it would not be proper to store the NFKD version of a
filename on disk without some method to retrieve (an equivalent of) the
original.
Post by Dave Chinner
Post by Ben Myers
My favorite example is the ways
"office" can be spelled, when "fi" or "ffi" ligatures are used. NFKDI adds
one more step of NFKD, in that it eliminates the code points that have the
Default_Ignorable_Code_Point property from the comparison. These code
points are as a rule invisible, but might (or might not) be pulled in when
you copy/paste a string to be used as a filename. An example of these is
U+00AD SOFT HYPHEN, a code point that only shows up if a word is split
across lines.
This extension does not appear to be specified by the unicode
standard - this seems like a dangerous thing to do when considering
compatibility with future unicode standards - we are not in the
business of extend-and-embrace here. Anyway, what happens if a
user actually wants a filename with a Default_Ignorable_Code_Point
character in it?
Such a filename can be created, and since the raw form of the name is stored
on disk, when the filename is read back the Default_Ignorable_Code_Point
will still be there. It just doesn't count when comparing names for equality.
Post by Dave Chinner
IMO, if cut-n-paste modifies the string being cut-n-pasted, then
that's a bug in the cut-n-paste application. I'd much prefer we use
a normalisation type that is defined by the standard than to invent
a new one to work around problems that may not even exist.
Post by Ben Myers
If a filename is considered to be binary blob, comparison is based on a
simple binary match. Normalization does not apply to any part of a blob.
See above: if we have unicode enabled, I think that we should reject
invalid unicode in filenames at normalisation time.
That was my original intent, which I abandoned based on the emails linked to
above.
Post by Dave Chinner
Post by Ben Myers
The code uses ("leverages", in corp-speak) the existing infrastructure for
case-insensitive filenames. Like the CI code, the name used to create a
file is stored on disk, and returned in a lookup. When comparing filenames
the normalized forms of the names being compared are generated on the fly
from the non-normalized forms stored on disk.
Again, why not store normalised forms on disk and avoid the need to
generate normalised forms for dirents being read from disk every
time they must be compared?
Post by Ben Myers
If the borgbit (the bit enabling legacy ASCII-based CI) is set in the
superblock, then case folding is added into the mix. This normalization
form we call NFKDICF. It allows for the creation of case-insensitive
filesystems with UTF-8 support.
Different languages have different case folding rules e.g. the upper
case character might be the same, but the lower case character is
different (or vice versa). Where are the language specific case
folding tables being stored? And speaking of language support, how
does this interact with the kernel NLS subsystem?
I use a full case fold as per CaseFolding.txt to obtain a result that is
consistent and (in my opinion) good enough.

Since XFS has no nls mount options, there is no interaction with the NLS
subsystem.
Post by Dave Chinner
Post by Ben Myers
-----------------------------------------------------------------------------
Implementation notes.
Strings are normalized using a trie that stores the relevant information.
The trie itself is part of the XFS module, and about 250kB in size. The
trie is not checked in: instead we add the source files from the Unicode
Character Database and a program that creates the header containing the
trie.
This is rather unappealing. Distros would have to take this code
size penalty if they decide one user needs that support. The other
millions of users pay that cost even if they don't want it. And
then there's validation - how are we supposed to validate that a
250k binary blob is correct and free of issues on every compiler and
architecture that the kernel is built on?
If your concern is that the generator might create bad blobs on some
architectures, then there are ways around that: checksums, checking in a
reference blob, or maybe something else.

As for size in general, looking at the NLS support I do not consider it to
be excessively big (as in, it is a bit less than 2 times the size of the
largest NLS module). Obviously opinions can differ on this.
Post by Dave Chinner
Post by Ben Myers
The key for a lookup in the trie is a UTF-8 sequence. Each valid UTF-8
sequence leads to a leaf. No invalid sequence does. This means that trie
lookups can be used to validate UTF-8 sequences, which why there is no
specialized code for the same purpose.
The trie contains information for the version of unicode in which each
code point was defined. This matters because non-normalized strings are
stored on disk, and newer versions of unicode may introduce new normalized
forms. Ideally, the version of unicode used by the filesystem is stored in
the filesystem.
The trie also accounts for corrections made in the past to normalizations.
This has little value today, because any newly created filesystem would be
using unicode version 7.0.0. It is included in order to show, not tell,
that such corrections can be handled if they are added in future revisions.
And so back to the stability of normalised forms: if the normalised
forms are stable and the trie encodes the version of codepoints,
then the data in the leaves of the trie itself must be stable. i.e.
even for future versions of the standards, all the leaves that are
there now will be there in the future. What is valid unicode now
will remain valid unicode.
The set of valid unicode code points is known and stable: 0..0x10FFFF minus
0xD800..0xDFFF. However, the set of assigned code points grows with each
revision of the unicode standard. Note that there is an explicit limitation
on the stability of normalized strings: they are stable if, and only if, no
unassigned codepoints are present in the string.
Post by Dave Chinner
And given that, why do we need to carry the trie around in the
compiled kernel? We have a perfectly good mechanism for storing
in files.
IOWs, the trie is really a property of the filesystem, not the
kernel or userspace tools. If we ever want to update to a new
version of unicode, we can compile a new trie and have mkfs write
that into new filesystems, and maybe add an xfs-reapir function that
allows migration to a new trie on an existing filesystem. But if we
carry it in the kernel then there will be interesting issues with
iupgrade/downgrade compatibility with new tries. Better to prevent
those simply by havingthe trie be owned by the filesystem, not the
kernel.
Hence I think the trie should probably be stored on disk in the
filesystem. It gets calculated and written by mkfs into file
attached to the superblock, and the only code that needs to go into
the kernel is the code needed to read it into memory and walk it.
That means we don't need 3,000 lines of nasty trie generation code
in the kernel, we don't bloat the kernel unnecessarily with abinary
blob, we don't need to build code with data from unverifiable
sources directly into the kernel, we can support different versions
of unicode easily, and so on.
Storing the trie in the filesystem is certainly an option, as is making XFS
UTF-8 support a config option.
Post by Dave Chinner
Post by Ben Myers
The algorithm used to calculate the sequences of bytes for the normalized
form of a UTF-8 string is tricky. The core is found in utf8byte(), with an
explanation in the preceeding comment.
Precisely my point - it's nasty, tricky code, and getting it wrong
is a potential security vulnerability. Exactly how are we expected
to review >3,000 lines of unicode/utf-8 minutae without having to
become unicode encoding experts?
The bits and pieces that are specific to unicode are smaller than that, much
of the complication of the generator is due to the work required to reduce
the size of the trie. The generator is included because we felt that
offering a large binary blob for checkin would also run into resistance.

Olaf
--
Olaf Weber SGI Phone: +31(0)30-6696796
Veldzigt 2b Fax: +31(0)30-6696799
Technical Lead 3454 PW de Meern Vnet: 955-6796
Storage Software The Netherlands Email: ***@sgi.com
Christoph Hellwig
2014-09-12 20:55:28 UTC
Permalink
Post by Olaf Weber
I looked up those discussions in the archives. For example, here's
Christoph about rejecting filenames if they're not well-formed unicode.
http://marc.info/?l=linux-fsdevel&m=120876935526856&w=2
http://oss.sgi.com/archives/xfs/2008-04/msg01263.html
And I might now disagree with my past self. While non-ut8 characters
are perfectly valid unix filenames, and I think everyones life is easier
if we generally stay out of the utf8 business it seems that for this
particular use case (shared filesystem with Windows, right) just
accepting utf8 should be fine. ZFS is doing, MacOS X apparently is,
and NFSv4 requires it, although as far as I know most implementations
ignore that requirement.
Olaf Weber
2014-09-15 07:16:24 UTC
Permalink
Post by Christoph Hellwig
Post by Olaf Weber
I looked up those discussions in the archives. For example, here's
Christoph about rejecting filenames if they're not well-formed unicode.
http://marc.info/?l=linux-fsdevel&m=120876935526856&w=2
http://oss.sgi.com/archives/xfs/2008-04/msg01263.html
And I might now disagree with my past self. While non-ut8 characters
are perfectly valid unix filenames, and I think everyones life is easier
if we generally stay out of the utf8 business it seems that for this
particular use case (shared filesystem with Windows, right) just
accepting utf8 should be fine. ZFS is doing, MacOS X apparently is,
and NFSv4 requires it, although as far as I know most implementations
ignore that requirement.
One issue is working in environments that are not UTF-8 clean. For example,
unpacking a tarball with non-UTF-8 filenames in it. The names would have to
be transcoded, which is only really possible if you know the original
character set. And if the filesystem flat out rejects non-UTF-8 filenames,
then you'd be unable to unpack the tarball at all.
--
Olaf Weber SGI Phone: +31(0)30-6696796
Veldzigt 2b Fax: +31(0)30-6696799
Technical Lead 3454 PW de Meern Vnet: 955-6796
Storage Software The Netherlands Email: ***@sgi.com
Dave Chinner
2014-09-16 20:54:06 UTC
Permalink
Post by Olaf Weber
Post by Christoph Hellwig
Post by Olaf Weber
I looked up those discussions in the archives. For example, here's
Christoph about rejecting filenames if they're not well-formed unicode.
http://marc.info/?l=linux-fsdevel&m=120876935526856&w=2
http://oss.sgi.com/archives/xfs/2008-04/msg01263.html
And I might now disagree with my past self. While non-ut8 characters
are perfectly valid unix filenames, and I think everyones life is easier
if we generally stay out of the utf8 business it seems that for this
particular use case (shared filesystem with Windows, right) just
accepting utf8 should be fine. ZFS is doing, MacOS X apparently is,
and NFSv4 requires it, although as far as I know most implementations
ignore that requirement.
One issue is working in environments that are not UTF-8 clean. For
example, unpacking a tarball with non-UTF-8 filenames in it. The
names would have to be transcoded, which is only really possible if
you know the original character set. And if the filesystem flat out
rejects non-UTF-8 filenames, then you'd be unable to unpack the
tarball at all.
So how do existing utf8/unicode enabled filesystems handle this?

I think we should be consistent with ZFS, MacOS and others that
already deal with this problem if at all possible. However, this
really is a wider policy decision for the kernel/VFS as we want
consistent behaviour across all linux filesystems, hence this
patchset really needs to discussed at the lkml/-fsdevel level...

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Christoph Hellwig
2014-09-16 21:02:35 UTC
Permalink
Post by Dave Chinner
So how do existing utf8/unicode enabled filesystems handle this?
I think we should be consistent with ZFS, MacOS and others that
already deal with this problem if at all possible. However, this
really is a wider policy decision for the kernel/VFS as we want
consistent behaviour across all linux filesystems, hence this
patchset really needs to discussed at the lkml/-fsdevel level...
Absolutely. I've also talked to a few Samba folks at SDC, and one
thing they would love to see is conditional case insensitive lookups,
e.g.:

- we hash case insensitive with collisions, but perform normal case
sensitive lookups.
- with a new AT_CASE_INSENSTIVE flag to the various *at calls that
gets passed down to the dcache we enable CI lookups.
Ben Myers
2014-09-16 21:42:50 UTC
Permalink
Hey Gents,
Post by Christoph Hellwig
Post by Dave Chinner
So how do existing utf8/unicode enabled filesystems handle this?
I think we should be consistent with ZFS, MacOS and others that
already deal with this problem if at all possible.
Here's a data point from man(zfs):

The following three properties cannot be changed after the file system
is created, and therefore, should be set when the file system is cre-
ated. If the properties are not set with the "zfs create" or "zpool
create" commands, these properties are inherited from the parent
dataset. If the parent dataset lacks these properties due to having
been created prior to these features being supported, the new file sys-
tem will have the default values for these properties.

casesensitivity = sensitive | insensitive | mixed

Indicates whether the file name matching algorithm used by the file
system should be case-sensitive, case-insensitive, or allow a com-
bination of both styles of matching. The default value for the
"casesensitivity" property is "sensitive." Traditionally, UNIX and
POSIX file systems have case-sensitive file names.

The "mixed" value for the "casesensitivity" property indicates that
the file system can support requests for both case-sensitive and
case-insensitive matching behavior. Currently, case-insensitive
matching behavior on a file system that supports mixed behavior is
limited to the Solaris CIFS server product. For more information
about the "mixed" value behavior, see the ZFS Administration Guide.

normalization =none | formD | formKCf

Indicates whether the file system should perform a unicode normal-
ization of file names whenever two file names are compared, and
which normalization algorithm should be used. File names are always
stored unmodified, names are normalized as part of any comparison
process. If this property is set to a legal value other than
"none," and the "utf8only" property was left unspecified, the
"utf8only" property is automatically set to "on." The default value
of the "normalization" property is "none." This property cannot be
changed after the file system is created.

utf8only =on | off

Indicates whether the file system should reject file names that
include characters that are not present in the UTF-8 character code
set. If this property is explicitly set to "off," the normalization
property must either not be explicitly set or be set to "none." The
default value for the "utf8only" property is "off." This property
cannot be changed after the file system is created.

The "casesensitivity," "normalization," and "utf8only" properties are
also new permissions that can be assigned to non-privileged users by
using the ZFS delegated administration feature.

The original link:
https://www.freebsd.org/cgi/man.cgi?query=zfs&apropos=0&sektion=0&manpath=FreeBSD+8.1-RELEASE&format=html
Post by Christoph Hellwig
Post by Dave Chinner
However, this
really is a wider policy decision for the kernel/VFS as we want
consistent behaviour across all linux filesystems, hence this
patchset really needs to discussed at the lkml/-fsdevel level...
Absolutely. I've also talked to a few Samba folks at SDC, and one
thing they would love to see is conditional case insensitive lookups,
- we hash case insensitive with collisions, but perform normal case
sensitive lookups.
- with a new AT_CASE_INSENSTIVE flag to the various *at calls that
gets passed down to the dcache we enable CI lookups.
I'm working on addressing some of the initial feedback and will be in a
position to post for a wider audience later in the week.

Thanks,
Ben

Josef 'Jeff' Sipek
2014-09-12 17:45:39 UTC
Permalink
...
Post by Dave Chinner
Post by Ben Myers
we must compare the normalized forms of strings, not just the raw sequences
of bytes. There are a number of defined normalization forms for unicode.
We decided on a variant of NFKD we call NFKDI. NFD was chosed over NFC,
because calculating NFC requires calculating NFD first, followed by an
additional step. NFKD was chosen over NFD because this makes filenames
that ought to be equal compare as equal.
But are they really equal?
Choosing *compatibility* decomposition over *canonical*
decomposition means that compound characters and formatting
distinctions don't affect the hash. i.e. "of'fi'ce", "o'ffi'ce" and
"office" all hash and compare as the same name, but then they get
stored on disk unnormalised. So they are the "same" in memory, but
very different on disk.
I note that the unicode spec says this for normalised forms
"A normalized string is guaranteed to be stable; that is, once
normalized, a string is normalized according to all future versions
of Unicode."
So if we store normalised strings on disk, they are guaranteed to
be compatible with all future versions of unicode and anything that
goes to use them. So why wouldn't we store normalised forms on disk?
I've had a very similar discussion about normalization in ZFS. Sadly, I
can't find where it happened so I can't point you to it. One interesting
point that I remember is that storing the original form may be less
surprising to an application. Specifically, the name it reads back is the
same it supplied during the creation. (Granted, if the file already exists,
the application will read back the new form.)

Just FWIW.

Jeff.
--
Only two things are infinite, the universe and human stupidity, and I'm not
sure about the former.
- Albert Einstein
Christoph Hellwig
2014-09-12 20:53:11 UTC
Permalink
Post by Dave Chinner
Post by Ben Myers
Implementation notes.
Strings are normalized using a trie that stores the relevant information.
The trie itself is part of the XFS module, and about 250kB in size. The
trie is not checked in: instead we add the source files from the Unicode
Character Database and a program that creates the header containing the
trie.
This is rather unappealing. Distros would have to take this code
size penalty if they decide one user needs that support. The other
millions of users pay that cost even if they don't want it. And
then there's validation - how are we supposed to validate that a
250k binary blob is correct and free of issues on every compiler and
architecture that the kernel is built on?
The way this needs to be done is to have a separate module for the
tables, which XFS or other users then can symbol_get if and only if
a mount requires it. The unicode tables should defintively be outside
of fs/xfs.

And please run this past lkml or -fsdevel, as people who actually
understand unicode and related issues are much more likely to be found
there than on the XFS list.
Loading...