Text file

Source: Wikipedia, the free encyclopedia.
Text file
Filename extension
.txt
Internet media type
text/plain
Generic container format

A text file (sometimes spelled textfile; an old alternative name is flatfile) is a kind of

delimiters
and will primarily store text files with lines separated as fixed or variable length records.

"Text file" refers to a type of container, while plain text refers to a type of content.

At a generic level of description, there are two kinds of computer files: text files and binary files.[1]

Data storage

A stylized iconic depiction of a CSV-formatted text file

Because of their simplicity, text files are commonly used for storage of information. They avoid some of the problems encountered with other file formats, such as endianness, padding bytes, or differences in the number of bytes in a machine word. Further, when data corruption occurs in a text file, it is often easier to recover and continue processing the remaining contents. A disadvantage of text files is that they usually have a low entropy, meaning that the information occupies more storage than is strictly necessary.

A simple text file may need no additional

character set) to assist the reader in interpretation. A text file may contain no data at all, which is a case of zero-byte file
.

Encoding

The

ISO-8859-16) for European languages and wide character
encodings for Asian languages.

Because encodings necessarily have only a limited repertoire of characters, often very small, many are only usable to represent text in a limited subset of human languages. Unicode is an attempt to create a common standard for representing all known languages, and most known character sets are subsets of the very large Unicode character set. Although there are multiple character encodings available for Unicode, the most common is UTF-8, which has the advantage of being backwards-compatible with ASCII; that is, every ASCII text file is also a UTF-8 text file with identical meaning. UTF-8 also has the advantage that it is easily auto-detectable. Thus, a common operating mode of UTF-8 capable software, when opening files of unknown encoding, is to try UTF-8 first and fall back to a locale dependent legacy encoding when it definitely is not UTF-8.

Formats

On most operating systems, the name text file refers to a file format that allows only

text terminals or in simple text editors. Text files usually have the MIME
type text/plain, usually with additional information indicating an encoding.

Microsoft Windows text files

DOS and

Notepad
) do not automatically insert one on the last line.

On Microsoft Windows operating systems, a file is regarded as a text file if the suffix of the name of the file (the "filename extension") is .txt. However, many other suffixes are used for text files with specific purposes. For example, source code for computer programs is usually kept in text files that have file name suffixes indicating the programming language in which the source is written.

Most Microsoft Windows text files use ANSI, OEM, Unicode or UTF-8 encoding. What Microsoft Windows terminology calls "ANSI encodings" are usually single-byte

line-drawing characters common in DOS applications. "Unicode"-encoded Microsoft Windows text files contain text in UTF-16 Unicode Transformation Format. Such files normally begin with byte order mark (BOM), which communicates the endianness of the file content. Although UTF-8 does not suffer from endianness problems, many Microsoft Windows programs (i.e. Notepad) prepend the contents of UTF-8-encoded files with BOM,[2] to differentiate UTF-8 encoding from other 8-bit encodings.[3]

Unix text files

On Unix-like operating systems, text files format is precisely described: POSIX defines a text file as a file that contains characters organized into zero or more lines,[4] where lines are sequences of zero or more non-newline characters plus a terminating newline character,[5] normally LF.

Additionally, POSIX defines a printable file as a text file whose characters are printable or space or backspace according to regional rules. This excludes most control characters, which are not printable.[6]

Apple Macintosh text files

Prior to the advent of macOS, the classic Mac OS system regarded the content of a file (the data fork) to be a text file when its resource fork indicated that the type of the file was "TEXT".[7] Lines of classic Mac OS text files are terminated with CR characters.[8]

Being a Unix-like system, macOS uses Unix format for text files.[8] Uniform Type Identifier (UTI) used for text files in macOS is "public.plain-text"; additional, more specific UTIs are: "public.utf8-plain-text" for utf-8-encoded text, "public.utf16-external-plain-text" and "public.utf16-plain-text" for utf-16-encoded text and "com.apple.traditional-mac-plain-text" for classic Mac OS text files.[7]

Rendering

When opened by a text editor, human-readable content is presented to the user. This often consists of the file's plain text visible to the user. Depending on the application, control codes may be rendered either as literal instructions acted upon by the editor, or as visible escape characters that can be edited as plain text. Though there may be plain text in a text file, control characters within the file (especially the end-of-file character) can render the plain text unseen by a particular method.

See also

Notes and references

  1. .
  2. ^ "Using Byte Order Marks". Internationalization for Windows Applications. Microsoft. Jan 7, 2021. Archived from the original on Feb 21, 2023. Retrieved 2022-04-21.
  3. ^ Freytag, Asmus (2015-12-18). "FAQ – UTF-8, UTF-16, UTF-32 & BOM". The Unicode Consortium. Retrieved 2016-05-30. Yes, UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order. An initial BOM is only used as a signature — an indication that an otherwise unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded data do not expect a BOM. Where UTF-8 is used transparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format that expects specific ASCII characters at the beginning, such as the use of "#!" of at the beginning of Unix shell scripts.
  4. IEEE Computer Society
    . Retrieved 2019-03-01.
  5. IEEE Computer Society
    . Retrieved 2015-12-15.
  6. IEEE Computer Society
    . Retrieved 2015-12-15.
  7. ^ a b "System-Declared Uniform Type Identifiers". Guides and Sample Code. Apple Inc. 2009-11-17. Retrieved 2016-09-12.
  8. ^ a b "Designing Scripts for Cross-Platform Deployment". Mac Developer Library. Apple Inc. 2014-03-10. Retrieved 2016-09-12.

External links