Author Topic: How to find and merge "near duplicate" strings (Read 2134 times)

Coleman Watts · November 13, 2014, 08:12:51 pm

I keep coming across strings in the code which are "almost" exactly the same, but just different enough to make translators live's a pain. Some examples:
"Online Contribution" and "Online Contribution:"
"External Id" and "External ID"
"Middle Name" and "Middle-Name"
"Continue >>" and "Continue"

So not being a translator myself, my questions are:

How much better would life be for translators if we solved this problem?
How do we solve this problem?

mathieu · November 14, 2014, 07:35:08 am

This is a tough question!

* Sometimes we do want a string "Online Contribution" and "Online Contribution:". By all means, never put the ':' separately. These must be two different strings. In French and some other languages, there is a half-space before the column. It also helps to identify the string's context (in a poor way, but still). Some languages may use a different declination depending on whether it is a column label, a button, etc.

* "Continue >>" and "Continue" is another tough one. I would argue that ">>" should never be in a string. Random ascii strings should never be used as art. It's not good for accessibility either (imagine what a screen reader saying "continue bigger than bigger than").

For the rest, I absolutely agree. I would normally encourage to enforce typographical rules, but I'm not sure it is really the source of the problem here (except strings such as "External ID"). Some strings are clearly mistakes ("Middle-Name"), not always easy to spot.

I wrote a quick and inefficient script to find such strings:
https://github.com/civicrm/l10n/blob/master/bin/find-similar-strings.php

You can test it by cloning the l10n repository, and running "echo po/pot/*.pot | php find-similar-string.php". It will take a while, so for testing you can also just test on a specific file, such as contribute.pot. On my computer (fairly fast), it would take about 15-20 mins to run on all 16k strings.

If it's useful, I would improve it to add things such as the printing of the file/line in the source code where the string was found.

Although ideally, would be nice if there was a simple way to add some strings to an "ignore" list. Then we could add it to jenkins and have it run regularly, or at least, it can be run before strings are sent to Transifex.

Coleman Watts · November 14, 2014, 11:31:45 am

When your computer finishes churning through it all could you share the output?

Coleman Watts · November 14, 2014, 11:33:27 am

Also, are those strings just from the 4.5 branch? No point including older strings from <= 4.4 - some of those I've already gone through and cleaned up recently so the 4.5 branch is probably a bit better than the others.

mathieu · November 14, 2014, 12:27:09 pm

Aha, right, those strings include many versions of CiviCRM (4.3-4.5), and it should run only on the latest version.

I'll try to extract for 4.5 and will share the output. Thanks for the fixes & raising the issue

mathieu · November 14, 2014, 01:03:19 pm

Here is the output:
https://www.bidon.ca/files/tmp/civicrm-similar-strings.txt

Coleman Watts · November 14, 2014, 01:37:44 pm

Wow, that's a long list!
A couple questions:

What is the significance of the % after the first string? I think I'd find it more helpful to get a total count of both strings
It looks like there are some duplicates in the list. E.g. I find "Contact" vs "Contacts" several times
This seems useful as a "fuzzy" dedupe rule, and it seems like if we want to run automated tests for it we also need some "strict" rules. E.g. I think our test suite ought to report a failure if there are 2 strings that are exactly the same except for punctuation or case.

Coleman Watts · November 16, 2014, 08:27:58 pm

Whew, I've just spent the better part of a day going through that massive list and fixing the obvious typos, mistakes, and inconsistencies. I've eliminated a few hundred unnecessary strings

https://github.com/civicrm/civicrm-core/pull/4568

I also noticed that our very old upgrade code contains a ton of unused strings, and there's an easy fix to that:
https://github.com/civicrm/civicrm-core/pull/4565

A few lessons learned from this little exercise:

There are a ton of strings.
There is not a ton of consistency. Capitalization, hyphenation, punctuation, and terminology are all over the place.
My cleanup only scratched the surface of the larger "style conventions" we would need to be truly consistent.
For developers, it's a lot easier to introduce a new string than to reuse an existing one. For the former, all you have to do is type whatever you want. For the latter, you need to search the existing code and be at least a little thoughtful about style and consistency.

Some lingering questions:

Is there anything we can do to make consistency easier for developers? I know that Joomla has the radically different approach of using constants rather than a function like ts(), which enforces a lot of consistency and raises the barrier for introducing a new string. I doubt we'll adopt that approach, but what else could we do to make our language more standardized?
The work I've just done is intended to make translators' lives a bit less awful, but when will my changes take effect? I'm committing the changes to 4.6, but if we continue the practice of translating multiple versions of Civi at once, does that mean they will still see all the crufty old strings from 4.5 and below for years to come?

totten · November 17, 2014, 01:44:32 pm

Perhaps we could recruit someone at each sprint to act as "editor" of the English language strings? The main prerequisites are a good grasp on English grammar and some basic technical skill (edit HTML, run CLI commands) -- someone at the sprint could probably help them through the rest (e.g. getting the latest strings with Mathieu's script; installing git).

FWIW, Joomla's approach (using symbolic constants instead of English-language phrases) is a nice in that it avoids unnecessary re-translation and reduces ambiguity for translators, but even if we went down that path... we'd still need an English-language editor to provide consistent phrasing in English.

A CI approach is interesting -- but really one can imagine a few alternative business-processes running through CI. For example:

1. Recruit an on-going English-language editor. Setup Jenkins to produce a monthly, weekly, or per-PR report on strings that were added and removed. Send the reports to the English-language editor.
2. Write a bunch of heuristics for recognizing good/bad strings -- and use them as tests (e.g. if one of the heuristics fails, then flag the PR as red). I'm a little skeptical about this -- I have trouble imagining good rules, and I can't find much precedent (the closest -- http://stackoverflow.com/questions/42557/best-way-to-incorporate-spell-checkers-with-a-build-process ). But then again... I haven't been digging through the strings; maybe there really are useful patterns?

mathieu · November 17, 2014, 02:17:19 pm

Wow, thanks Coleman!!

I used to review the strings before pushing them to Transifex, but often I run out of time, and only review the most obvious mistakes. Sometimes reviewing strings opens a big can of usability or programming worms, but I guess that most of the typos etc would be a quick fix.

We could imagine a 1 week period (before a push to Transifex) where we post the list of strings and invite people to review before sending to Transifex?

Michael McAndrew · November 18, 2014, 09:48:05 am

Why not create some text style conventions for the interface - there are quite a few points in here that I think would be useful already.

I can't help at this point mentioning my love for Sentence case. Only problem is that I think dgg is equally enamoured with Title Case. What was that George Bernard Shaw quote again? "England and America are two countries divided by a Capitisation Schema" - something like that anyway. Of course Drupal uses Sentence case but when have we ever followed their lead?!

Coleman Watts · November 18, 2014, 10:08:41 am

With regard to CI and Totten's questions, I think

Most discernment about whether a string is in the right format will have to be evaluated by a human.
But there are a few obvious things we could check automatically that we don't currently have unit tests for:
- No concatenated variables inside ts()
- No divs or other complex html tags (links are okay but imo the attributes including href should be a placeholder, <br> tags are iffy, <p> tags I think should always be avoided)
- As bgm pointed out, we should disallow ascii art such as "Continue >>" (some of our current strings would fail this test )

Does anyone have an idea about how the mechanics of this would work? How would the unit test extract the strings? Is it possible to have a PR unit test scan only the code that's changed?

Coleman Watts · November 27, 2014, 08:06:13 am

@MichaelMcandrew would you like to draft a "style guide"?

Coleman Watts · November 27, 2014, 08:07:26 am

@mathieu I'm still wondering about when the changes I've made will take hold. I'm happy to have eliminated so many unnecessary strings, but since I committed to "master," when will they actually disappear for our translators?

Dave Greenberg · December 01, 2014, 10:35:38 am

Quote from: Michael McAndrew on November 18, 2014, 09:48:05 am

I can't help at this point mentioning my love for Sentence case. Only problem is that I think dgg is equally enamoured with Title Case. What was that George Bernard Shaw quote again? "England and America are two countries divided by a Capitisation Schema" - something like that anyway. Of course Drupal uses Sentence case but when have we ever followed their lead?!

I'm happy to give up on "Title Case" if folks think sentence case is more manageable (especially for translators).

CiviCRM Community Forums (archive)

News:

Author Topic: How to find and merge "near duplicate" strings (Read 2134 times)

Coleman Watts

How to find and merge "near duplicate" strings

mathieu

Re: How to find and merge "near duplicate" strings

Coleman Watts

Re: How to find and merge "near duplicate" strings

Coleman Watts

Re: How to find and merge "near duplicate" strings

mathieu

Re: How to find and merge "near duplicate" strings

mathieu

Re: How to find and merge "near duplicate" strings

Coleman Watts

Re: How to find and merge "near duplicate" strings

Coleman Watts

Re: How to find and merge "near duplicate" strings

totten

Re: How to find and merge "near duplicate" strings

mathieu

Re: How to find and merge "near duplicate" strings

Michael McAndrew

Re: How to find and merge "near duplicate" strings

Coleman Watts

Re: How to find and merge "near duplicate" strings

Coleman Watts

Re: How to find and merge "near duplicate" strings

Coleman Watts

Re: How to find and merge "near duplicate" strings

Dave Greenberg

Re: How to find and merge "near duplicate" strings