Original Reddit post

This report organizes only the facts observable by the user regarding the process presented as “image editing” within the ChatGPT application. The conclusion is clear. This process does not perform localized edits on the original image uploaded by the user. The process that is actually invoked is image_gen.text2im. On the returned side, DALL-E generation metadata is displayed; even when edit_op: “inpainting” appears, the output is not a localized edit, but a full-frame regeneration. Moreover, at an earlier stage, the original image file itself is not transmitted, retained, or referenced in its original form. Therefore, the “image editing” observed in this chat is not editing of the original image. It is a text-to-image full-frame regeneration using a reduced and converted derivative image as reference input. The original image file uploaded by the user is not processed as-is. At the upload stage, ChatGPT handles a reduced and converted derivative image distinct from the original. The tool invoked during image processing is image_gen.text2im. Every returned result displays DALL-E generation metadata. Even when edit_op: “inpainting” is displayed, the actual output is not localized editing but full-frame regeneration. Even when the correction area is explicitly specified, the process proceeds on the premise of masking, and inpainting is displayed, the entire image—including areas outside the specified region—changes at the pixel level. The hash of the output image is also entirely different from that of the original. Therefore, this is not “image editing.” Nor is it editing based on the original image. It is image_gen.text2im / T2I full-frame regeneration using a reduced and converted derivative image as input. The original image file itself is not transmitted as-is. The user is using an image-upload feature described as permitting uploads of up to 20 MB. However, actual network monitoring showed that even when a large image was selected and uploaded, the amount of data transferred was only about 300 KB. This is decisive. If a 20 MB-class, or even several-megabyte, original image file were being sent to the server as-is, a corresponding amount of network traffic should occur. Since only about 300 KB of data is transmitted, the original image file itself is not being sent as-is. At this point, the premise that “the original image is uploaded as-is and that original image is then edited” collapses. The original image and the image handled on ChatGPT’s side are different objects. The original image information on the user’s side was as follows: Filename: 1000045047_x4_drawing.png Format: PNG Resolution: 2048 × 2048 Size: 5.58 MB SHA-1: 69ba09b9718bc43947e0f6510bab65319e3e0a42 SHA-256: 2d6a15d7deb517c5e8885512ec73d79bd2535d5d5311a8e76a793fed391ec114 By contrast, the image accessible to the assistant within this conversation was as follows: Format: JPEG Resolution: 1536 × 1536 Size: 420,655 bytes SHA-1: deff635b673de90cbadf603ce81c548cb2a805a9 SHA-256: 0239d63859547149e61e5c987897291713593da222a63f7f0635e3bc0bce4d53 The format, resolution, file size, and hashes all fail to match. In other words, what the assistant and the image-processing side are referencing is not the user’s original image file itself. It is a reduced and converted derivative image created during the upload stage or internal expansion stage. The explanation that the image is “temporarily compressed for transmission and later restored to the original” is untenable. It is not credible to claim that an image of 20 MB, or even several megabytes, is reduced to approximately 300 KB for transmission and then later perfectly restored for use as the original. For such an explanation to hold, the following would be necessary: The original image must be losslessly recoverable from the transmitted data. The restored image must contain pixels identical to those of the original. The hashes must also match the original image. In reality, however, the image accessible to the assistant does not match the original in format, resolution, file size, or hash. Therefore, this is not “temporary compression.” The original image is not sent as-is, nor is it restored to the original. A derivative image is created, and that derivative image becomes the object of processing. There is no indication that the original image file is reacquired or re-expanded during image editing. One might argue that, even if only a lightweight derivative image is sent at upload time, the system later retrieves the original image file or equivalent original-quality data during the image-editing operation and processes it at high quality. This argument also fails. When image editing was actually executed: The tool invoked was image_gen.text2im. The returned image was approximately two megapixels. No increase in network traffic corresponding to an image file of that size was observed before or after the operation. Only lightweight control or text-output traffic appeared to be occurring. The downloaded image after generation was likewise an approximately two-megapixel image. If the original image file were being reacquired or re-expanded during editing, network traffic corresponding to the image size should have occurred. It did not. Therefore, the original image file is not being used even at the image-editing stage. What is used during editing is the derivative image handled within the chat. The invoked tool is image_gen.text2im, not an image-editing tool. Although the feature is being used as image editing, the tool actually invoked by the assistant was image_gen.text2im. This is the name of a text-to-image process. Therefore, at least according to the execution information observable by the user, the invoked process is not “image editing” but “text-to-image.” This point is critically important. If the operation were localized editing or inpainting, the process name or process structure should correspond to that function. In reality, however, the invoked process is text2im. Every returned result displays DALL-E generation metadata. Upon examining the images returned as generation results in this chat, DALL-E generation metadata was displayed in all 16 of the 16 confirmed cases. In other words, although the feature is being used in the context of GPT Images / ChatGPT Images 2.0 image editing within the ChatGPT application, the returned metadata is always DALL-E generation metadata. The important point here is not speculation about whether DALL·E is truly operating internally. The observable fact is that the metadata visible to the user is consistently DALL-E generation metadata. The displayed context and the returned metadata are not aligned. The process is invoked as text2im, returned as inpainting, and produces full-frame regeneration. In some returned metadata, edit_op: “inpainting” was displayed. However, the tool actually invoked was image_gen.text2im. Thus, the observable correspondence is as follows: Invoked process name: image_gen.text2im Returned metadata: edit_op: “inpainting” Actual output: full-frame regeneration This is fundamentally inconsistent. A process invoked as text-to-image is labeled on return as inpainting, while the output is not a localized edit but an image whose entire frame has changed at the pixel level. Therefore, the process name, returned metadata, and actual result do not agree. At least in this observation, this is not inpainting in the sense expected by the user. The correction area was explicitly specified. The problem is not that “the user gave vague instructions.” In fact, across multiple attempts, the user clearly specified the following: Which area should be corrected Which areas should be preserved Only the lower body Only from the waist downward Preserve the face, hair, upper body, and background Preserve the clothing Do not alter anything outside the specified area Use a mask Proceed on the premise of inpainting In other words, the target area for editing was not ambiguous. The premise of localized editing and inpainting was stated clearly. Even so, the results changed regions far beyond the specified area. Therefore, this problem did not occur because the correction area had not been specified. The entire image, including unspecified regions, changes at the pixel level. This is the most serious practical harm. When the original and output images are compared, not only the specified region but the entire frame, including areas outside the specified region, has changed at the pixel level. The following elements changed: Background Hair Face Outfit Contours Coloring Ornaments Shape of shadows Composition Legs Shoes This is not merely a case of slight influence around the edited area. The entire image has been reconstructed. In localized editing, the majority of the unspecified regions should preserve the original pixels, or at least a structure very close to them. That is not what occurred here. Therefore, this is not localized editing. The hash of the output image is also entirely different. The original image and the output image differ not only visually, but also lack continuity as files. The hash of the output image is completely different from that of the original. This is significant. If localized editing were replacing only a portion of the image while preserving most of the original, one would expect at least some continuity as an edited result based on the original image. In reality, however, all three of the following are true: The entire image changes at the pixel level. Unspecified regions also change comprehensively. The output image hash is entirely different. Therefore, this is not “the result of partially editing the original image.” It is a newly generated image created with reference to the original. The resolution is not consistent. Although the original image is uploaded at roughly one megapixel or higher resolution, the processed and returned images are handled at around two megapixels, or after being converted to another resolution. The important point is that the resolution of the input image does not match the resolution of the processing target or returned image. This is not the behavior of localized editing. Rather than using the original image itself as the base for partial editing, the system appears to transfer the image into a different resolution regime and reconstruct it there. Therefore, at minimum, this process is not “editing the original image itself.” Aspect-ratio and canvas specifications do not function as independent factors. Ordinarily, the conditions passed to an image engine should include structured parameters handled separately from the prompt text itself. At minimum, the following should be treated as independent factors: Aspect ratio Canvas size Reference image Image to be edited Mask or target editing area Style-preservation conditions In practice, however, the conditions specified by the user do not operate rigorously as independent factors. Aspect-ratio specifications are not reliably obeyed. Canvas conditions are not passed through as-is. The editing area is not fixed. This is because conditions that ought to be handled as independent control factors are instead forced into the prompt text, and even that text itself is summarized or compressed. As a result, size, ratio, editing range, preservation conditions, and style conditions are dropped, weakened, or entangled. This input design is broken. The user input, the assistant-created prompt, the tool call, and the prompt in the returned metadata do not match. Even when the user explicitly sends text and states, “treat this as the prompt,” that text is not necessarily used as the actual input to the image engine. The assistant translates it into English, adds supplementary details, appends conditions, and sends a different text to the tool. An additional problem is that, in some cases, the returned metadata shows prompt: “” as an empty field. Thus, at least within the range observable by the user, the following do not match: The user’s input text The prompt text created by the assistant The prompt used in the image-tool call The prompt shown in the returned metadata Under these conditions, the user cannot verify what was actually supplied to the image engine. Reproducibility and transparency are not achieved. The actual result is not “correction” but a full reinterpretation each time. Even when localized corrections are requested for fingers, the face, the lower body, or similar elements, parts that were not specified are reinterpreted each time. Typically, the following were affected: Directionality of the face Hair color Ribbons Clothing Background density Structure of the painted planes Leg structure Shoe shape In other words, the workflow is not “preserve the parts that have been fixed, then correct only the remaining unfixed parts.” Instead, the entire image is reinterpreted each time, and even previously corrected parts regress. This is not image editing; it is the behavior of regeneration. Fragmented and mosaic-like coloring arises not as a failure of localized editing, but as a side effect of full-frame regeneration. The outputs repeatedly exhibited breakdowns in coloring such as the following: Small fragmentary shadows Mosaic-like coloring Speckled highlights Clusters of tiny paint fragments A glaring, glittering texture Unnaturally high density Even after repeatedly specifying “flat coloring,” “no mosaic-like coloring,” “organize into large planes,” and “do not subdivide,” the problem did not stop. This is because the system is not editing the specified local area, but regenerating the entire frame. Neither preservation of the coloring nor localized retention is functioning. As a result, the overall coloring style is reconstructed every time. Even at the chat-thumbnail stage, the original image data is not handled as-is. From the moment the image is displayed in the chat, it is already no longer the original image itself. What is displayed is a thumbnail or otherwise processed derivative image. After that, even when the image engine is invoked, no network traffic corresponding to the image size occurs. In other words, the image-system data visible in the chat is itself being used as the processing target, and the original image file is not being fetched again. The image ultimately downloaded is, in the end, a separately generated image. The entire flow is consistent not with “editing the original image,” but with “regeneration using a derivative image as reference.” Although presented as image editing, the actual process is image_gen.text2im / T2I full-frame regeneration. Summarizing the observed facts above, the processing structure is consistent: The original image file itself is not sent. The original image file itself is not retained or reacquired. What is referenced is a reduced and converted derivative image. The invoked tool is image_gen.text2im. The returned metadata is DALL-E generation metadata. Even with edit_op: “inpainting”, localized editing is not achieved. The entire frame, including unspecified areas, changes at the pixel level. The hash becomes entirely different. Therefore, the process observed in this chat is not image editing. It is image_gen.text2im / T2I full-frame regeneration using a reduced and converted derivative image as input. In voice input, fixed text not spoken by the user is transmitted. Separate from the image-related issues, there was also a serious anomaly in input processing. During voice input, the UI displays a waveform and appears to be processing audio input. In reality, however, the spoken content is not transmitted; instead, fixed text such as the following is sent: “This transcript may contain references to ChatGPT, OpenAI, DALL·E, GPT-3, GPT-4.” “This transcript may include references to ChatGPT, OpenAI, DALL·E, GPT-3, GPT-4.” This is not the user’s speech. Nor is it a mere speech-recognition mistranscription. An internal boilerplate sentence or notice is being transmitted as user input. Thus, not only in the image-generation system but also in input processing, the state shown in the UI and the content actually transmitted do not match. This is not a mere quality issue. Nor is it simply a matter of “a bad prompt,” “overly complex instructions,” or “the editing area expanding.” The essence of the problem is as follows: The original image itself is not sent. The original image itself is not retained or reacquired. A reduced and converted derivative image becomes the processing target. The invoked process is image_gen.text2im. The returned data is DALL-E generation metadata. Even when inpainting is displayed, the result is not localized editing. The entire image, including unspecified areas, changes at the pixel level. The hash also becomes entirely different. Nevertheless, in the UI context, the operation is treated as “image editing.” Therefore, this is a problem in which the description “image editing” does not match the actual processing performed. It is a transparency problem, an input-design problem, and a discrepancy between functional labeling and real behavior. Clearly state whether the original image file itself is actually transmitted, retained, and referenced. If the image is converted into a derivative image after upload, clearly disclose that specification. Clearly explain why the invoked tool is image_gen.text2im. Clearly explain why DALL-E generation metadata is returned. Clearly explain the conditions under which edit_op: “inpainting” is displayed, and what it actually means. Clearly state whether the process is localized editing or full-frame regeneration. Clearly explain how masks and target editing areas are actually handled. Clearly explain how independent factors such as aspect ratio, size, and style-preservation conditions are passed to the engine. Clearly explain the relationship among the user input, the assistant-generated prompt, the actual engine input, and the prompt shown in the returned metadata. Explain the input anomaly in which internal boilerplate text is inserted during voice input. The process observed in this chat is not editing of the original image. It is image_gen.text2im / T2I full-frame regeneration using a reduced and converted derivative image as reference. Moreover, it has been observed in the following form: It is invoked as image_gen.text2im. It returns DALL-E generation metadata. It may even be displayed as inpainting. In reality, it is not localized editing. The entire frame, including unspecified regions, changes at the pixel level. The hash becomes entirely different. Under these conditions, presenting the feature as “image editing” is inaccurate. Allowing users to treat it as image editing without clearly disclosing the actual processing gives rise to misunderstanding. This report demonstrates that such misunderstanding is supported by observable facts. submitted by /u/lucidity3K

Originally posted by u/lucidity3K on r/ArtificialInteligence