Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
unicode and utf-8
#1
Hey guys!
I don't have much experience with Python and I need some clarifications.

What I want to do?
I want to keep the some data in Unicode format. The data is sent to a third-party and stored there.

From what I've read "Python often defaults to using it" (https://docs.python.org/3/howto/unicode.html)

Given the use case:
I recieve the data from multiple databases, some data is recieved as bytes some as string(str).
I do that:
data = recieved_data.decode("utf-8").
Is there anything to do to convert the recieved data to Unicode?

Thanks alot!
Sorin
Reply
#2
(Jan-13-2026, 04:44 PM)sorynturda Wrote: Is there anything to do to convert the recieved data to Unicode?
The line that you wrote converts the data to a Python unicode string, provided the data was previously encoded with the utf8 encoding in the database. Unicode is not a format, unicode is a universal alphabet that contains all the letters of all the languages in the world and more. A unicode string is a sequence of symbols from this gigantic alphabet. If you want to store unicode data into a file, it must be encoded in some way.

So what does your question mean exactly?

I must add that in order to send unicode data through a network, it must also be encoded, but this is often hidden from the client code by the software components that implement network communications.
« We can solve any problem by introducing an extra level of indirection »
Reply
#3
I get the data from mongodb using pymongo, where it is stored. Then, in the code, is converted to a dict of strings. ONLY str data type.
After that, the data is processed and sent to a third-party, which is external and the data sent must be in unicode format.

And here I have doubts. Does the data need to processed in order to be in unicode format or it is already in unicode format.
Same goes for other data type, that comes from ldap but as bytes. I just decode it to process it (add some elements in list, remove some, et cetera)

Quote:I must add that in order to send unicode data through a network, it must also be encoded, but this is often hidden from the client code by the software components that implement network communications.
Yes, basically this is what it happens in code. Data is sent via HTTP or TCP (LDAP) to third-parties.
Reply
#4
(Jan-14-2026, 07:39 AM)sorynturda Wrote: Does the data need to processed in order to be in unicode format or it is already in unicode format.
Again, UNICODE IS NOT A FORMAT, there is no way to format anything in "unicode format". UTF-8 or UTF-16 are formats to encode unicode characters. ISO-8859-1 or euc_jisx0213 are encoding formats that encode some ranges of unicode characters.

Starting with Python 3.0, the Python type str is an implementation of abstract unicode strings. Conceptually, a str instance is a sequence of unicode code points. If you want to write it to a file or send it through a socket, you need to choose an encoding because files and sockets don't accept abstract unicode characters, they accept only bytes (but again this may be hidden by software components).
« We can solve any problem by introducing an extra level of indirection »
Reply
#5
Watch some YouTube videos about Unicode and utf8.

Actually, CPython stores internally str in utf8, but this is an implementation detail.


If you work with text files, the encoding is mostly utf8. Sometimes there are old text files with a complete different encoding.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  clean unicode string to contain only characters from some unicode blocks gmarcon 2 5,903 Nov-23-2018, 09:17 PM
Last Post: Gribouillis

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020