Mohammedz.com

For Linux and Shell scripting.

How to remove non-printable/control characters from a file?

3 Comments


You may find difficulties with non-printables in your files. You can see such characters if you open your files in editors like vi. Eventhough commands like “cat” won’t output such non-printable characters into console by default, you can’t remove them by redirecting “cat” output to a different file.

Here is a way to remove non-printable characters with a combination of sed and tr commands.

Step 1:
Use sed with “l” (lower case L) option to print the file/line in a “visually unambiguous” form. From sed output, find the character notation that needs to be removed.

#sed -n ‘l’ filename.txt

Step 2:
Remove control characters that you found from sed output using “tr” or “sed”.

suppose you want to remove the form feed character “\f” lines from filename.txt, use any of the commands given below.

# tr -d ‘\f’ filename.txt
or
# sed ‘/\f/d’ filename.txt

You can any of the commands given below if you want to remove such control characters only, but not the entire line containing them.

# tr ‘\f’ ‘ ‘ filename.txt
or
# sed ‘s/\f//’ filename.txt

The control characters in ASCII still in common use include:

* 0 (null, , ^@), originally intended to be an ignored character, but now used by many programming languages to terminate the end of a string.
* 7 (bell, \a, ^G), which may cause the device receiving it to emit a warning of some kind (usually audible).
* 8 (backspace, \b, ^H), used either to erase the last character printed or to overprint it.
* 9 (horizontal tab, \t, ^I), moves the printing position some spaces to the right.
* 10 (line feed, \n, ^J), used as the end_of_line marker in most UNIX systems and variants.
* 12 (form feed, \f, ^L), to cause a printer to eject paper to the top of the next page, or a video terminal to clear the screen.
* 13 (carriage return, \r, ^M), used as the end_of_line marker in Mac OS, OS-9, FLEX (and variants). A carriage return/line feed pair is used by CP/M-80 and its derivatives including DOS and Windows, and by Application Layer protocols such as HTTP.
* 27 (escape, \e [GCC only], ^[).
* 127 (delete, ^?), originally intended to be an ignored character, but now used to erase a character (especially the one to the right of the cursor).

Read more about control characters at wikipedia.

~mohammed

About these ads

3 thoughts on “How to remove non-printable/control characters from a file?

  1. Helpful and very nicely written.

  2. Brilliant piece of information. Gave everything I needed to know in a very succint form.. Thanks..

  3. To filter out lines in text file with control characters use this:
    cat somefile | egrep -v “[[:cntrl:]]”
    or
    cat somefile | sed -r “s/[[:cntrl:]]//g”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.