mapsedge: Me at Stone Bridge Coffee House (Titanic)
[personal profile] mapsedge
Is there a regular expression that would allow me to replace any character above ASCII-128 in a string with the HTML entity equiv, an in:

replace in {string}:  ascii-X with &#X; where X between 128 and 255

It would save a nested loop if I could figure out how to do it in one statement.

Date: 2008-08-14 15:48 (UTC)
From: [identity profile] akaashben.livejournal.com
I don't see how to do it without a loop, since you have to examine each character in turn to determine its ascii value.

Even if you could convert the entire string to its component ascii values before looking at it, you would still have to break it up in to 3-digit chunks and examine each one before converting it back. Running a regex search on 3-digit chunks of that would very likely mess it all up.

Date: 2008-08-14 16:02 (UTC)
From: [identity profile] joegoda.livejournal.com
Here's where I show my ignorance... but are you looking for something like this?

ascii_to_entities(str [string])

Converts ASCII to Entities

Class: Regex (REGX)

Description: Returns str after converting higher ASCII values into HTML entities where possible. Only use when the auto_convert_high_ascii config file preference is set to yes (i.e. $PREFS->ini('auto_convert_high_ascii') == 'y').

Date: 2008-08-14 16:32 (UTC)
From: [identity profile] rowangolightly.livejournal.com
Y'all are speakin' that foreign language again....

Date: 2008-08-15 16:55 (UTC)
From: [identity profile] billthetailor.livejournal.com
Well, as with any foreign language, it's best to start with a few basic phrases, such as:

010010000110010101101100011011000110111100101110

translation: Hello.

01010111011010000110010101110010011001010010000001101001011100110010000001110100011010000110010100100000011000100110000101110100011010000111001001101111011011110110110100111111

translation: Where is the bathroom?

0101011101101000011001010111001001100101001000000110100101110011001000000111010001101000011001010010000001101110011001010110000101110010011001010111001101110100001000000111001001100101011100110111010001100001011101010111001001100001011011100111010000111111

translation: where is the nearest restaurant?

I hope this handy guide will help you the next time you visit, and please enjoy your stay! :)

Date: 2008-08-15 19:50 (UTC)
From: [identity profile] rowangolightly.livejournal.com
::runs away to hide in my sewing room where it's safe::

Date: 2008-08-14 18:00 (UTC)
From: [identity profile] jehosefatz.livejournal.com
Dunno. I'd probably should for someting like:

$source =~ tr/[\xxx]/rvalue/g;

(in Java or Perl) Where...

tr = translate command
\xxx = the octal value for the thing you're looking for
rvalue = the replacement string
g = do it globally in the source string

Technically not a regex, but similar.

The downsides are that you have to know the octal for what you're looking for and you have to have one of those translations for each potential target.

The other downside is that high-order ascii (128-255) is OS/hardware implementation dependent and several things that are handled in some implementations are actually only universally available in Unicode (UTF-8, UTF-16, and the like.) The various ISO charsets (ISO-8859-1 Latin, for example) are spotty in their implementations. An example would be the trademark character (tm in a circle) -- It's a 2-byte character in unicode represented by 8482 hex. In that case, character scanning would only work if you're OS/language/source material was unicode encoded so that each "character" would be a multi-byte entity.

- Jeho

June 2023

S M T W T F S
    123
45678910
11121314151617
1819 2021222324
252627282930 

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jan. 27th, 2026 20:45
Powered by Dreamwidth Studios