SaveFile / LoadFile fails if file contains ANSI chars > x'7F'

SyntaxEditor for Windows Forms Forum

Posted 4 years ago by Mike Dempsey - Sr. Developer, Teradata Corp
Version: 14.1.0323
Avatar

Your 'default' LoadFile / SaveFile functions (where encoding is not specified) say they save/load UTF8 format.

However if I use SaveFile to save a file that contains european accented characters (in the x'80' to x'FF' range) it does in deed save those characters using UTF8 encoding, but it does not include the UTF8 BOM.

If I open the file in Notepad everything works fine, but if I open it using your default LoadFile() method the high range characters are loaded as though they were 2 'low range' ASCII characters. i.e. it screws up the data.

Since I work with a lot of european customers this is a major issue.

I originally got around the issue by always specifying a UTF8 encoding - which does write the BOM on the file.

However customers want to be able to open the files on non-Windows systems that do not understand a Windows BOM, so they want the files to be saved as ANSI (i.e. the current code page) unless those files contain characters that are not in the current code page. In which case we would save as UTF8 - but with a BOM.

This is exactly how the default SaveFile() function in my previous tool worked... and appears to be how Notepad works.

You do not apparently have a version of SaveFile that can do this so my real question is whether you have a method that can examine the text and tell me if it contains any characters that are not in the current code page, so that I can do this myself.
(I would also ask that you consider changing your default SaveFile() function, or add an additional overload, that works in this 'standard' fashion.)

Comments (4)

Posted 4 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar

Hi Mike,

Our SaveFile and LoadFile methods are very simple and pretty much follow the methods that .NET makes available to us for working with files.

The basic SaveFile just does "StreamWriter writer = File.CreateText(path)" and then outputs the text to that writer.  The other overload calls "File.Create(path)" to create the file and then makes a new StreamWriter that specifies the Encoding you pass, forcing it to use that.

Similarly the basic LoadFile uses File.OpenText and reads the string to the end.  The other overload uses File.OpenRead(path) and creates a StreamReader with the Encoding you pass, forcing it to use that.

If you are looking for more complicated logic, you could create it yourself since our logic is just about 3 lines long for each of those methods.  We don't have anything that scans text for adhering to particular encodings since we are only calling the core I/O methods described above for input/output.  I would be interested in seeing whatever logic you come up with for that, as I'm sure other customers could use it.


Actipro Software Support

Posted 4 years ago by Mike Dempsey - Sr. Developer, Teradata Corp
Avatar

In that case you should change your documentation for the LoadFile() method since it doesn't really open a UTF8 file (which is what SaveFile creates) it actually EXPECTS to load a UTF8 file - but if it doesn't find a UTF8 BOM it will load the file as though it were an ANSI file. (effectively corrupting the file if it was not saved with ANSI encoding.)

The way to handle this is to use the following:

        If IsInCodePage() Then
            Document.SaveFile(fileName, Encoding.Default, LineTerminator.CarriageReturnNewline)
        Else
            Document.SaveFile(fileName, Encoding.UTF8, LineTerminator.CarriageReturnNewline)
        End If

    Private Function IsInCodePage() As Boolean
        'Check to see if the text is all within the current code page
        Dim encoder As Encoding = Encoding.GetEncoding(Encoding.Default.WindowsCodePage,
                                                       New EncoderExceptionFallback(),
                                                       New DecoderExceptionFallback())
        Try
            encoder.GetBytes(Document.Text)
            Return True
        Catch
            Return False
        End Try

    End Function

This is rather ineficient though if the document is large.

GetBytes() actually performs the encoding so if the text is entirely within the code page this encoding gets performed twice. Once in my code and a second time in your SaveFile method.
That is why this logic would normally be performed by the controls 'Save' method itself.

I realize I could simply open the output file myself and then write the byte array that was returned, but that would mean that the logic used to save the file in this specific case would look very different from when I save it with a user specified encoding. 

Posted 4 years ago by Actipro Software Support - Cleveland, OH, USA
Avatar

Hi Mike,

The MSDN documentation on File.CreateText(path), which is what we use in the one overload, says:  "Creates or opens a file for writing UTF-8 encoded text."  Based on that, I would have assumed that it would write no UTF8 BOM if the chars 0-127 are only used, but would use a BOM if any other characters were used.

Similarly, the MSDN docs say for File.OpenText(path):  "Opens an existing UTF-8 encoded text file for reading." 

From doing more searching on the web and looking at the StreamWriter source in Reflector, it does appear that the File.CreateText(path) documentation is misleading as it only seems to output UTF8 without a BOM.

As for your custom logic above, why not just wrap it up as an extension method for the Document class?  Then it wouldn't look any different to the callers.


Actipro Software Support

Posted 4 years ago by Mike Dempsey - Sr. Developer, Teradata Corp
Avatar

You are correct. The MSDN info is also misleading.

As you say CreateText() writes UTF8 - but without a BOM
and OpenText() reads UTF8  - but only if it starts with a UTF8 BOM.

That means that OpenText() can not [always] correctly open a file that was saved using CreateText().
Which I think is a bug ... but getting Microsoft to fix something is like banging your head against a brick wall.
(I know this from trying for years to get them to fix some of the many bugs in their Data Provider for ODBC )

It would be helpful if you could change your docs to describe what really happens rather than what Microsoft implies in their own docs.

I do use an extension class already so I guess I could look at adding my own method to take care of the issue.

But if you sell your app to a lot of folks in Canada or Europe I suspect many of them will hit this issue also, so it would be very useful if you changed the way that specific SaveFile() method works so that it will work the way people expect a Save/Load pair to work.

The latest build of this product (v2018.1 build 0341) was released 6 months ago, which was after the last post in this thread.

Add Comment

Please log in to a validated account to post comments.